Pandas - 语法基础 Pt.4

发表于 2022-07-31 更新于 2024-03-30 分类于 pandas

这篇随笔主要介绍 Pandas 的两种数据类型：Series 和 DataFrame 相关的内容
Pt.1 部分主要介绍对 Pandas 两种数据类型的基本操作，包括创建、索引和修改
Pt.2 部分详细介绍了 Pandas 的索引操作
Pt.3 部分主要介绍了 Pandas 的计算和函数
Pt.4 部分详细介绍了 Pandas 的统计计算相关的函数

1
2
3

import numpy as np
import pandas as pd
from pandas import Series, DataFrame

统计计算

frame = pd.DataFrame([[1, np.nan], [2, 4],
                      [np.nan, np.nan], [3, 5]],
                    index=['a', 'b', 'c', 'd'],
                    columns=['one', 'two'])

obj.sum( axis, skipna, level ) : 求和

frame

	one	two
a	1.0	NaN
b	2.0	4.0
c	NaN	NaN
d	3.0	5.0

1	frame.sum()

1
2
3

one    6.0
two    9.0
dtype: float64

1	frame.sum(axis=1)

a    1.0
b    6.0
c    0.0
d    8.0
dtype: float64

1	frame.sum(axis=1, skipna=False)

a    NaN
b    6.0
c    NaN
d    8.0
dtype: float64

obj.mean( axis, skipna, level ) : 求平均数

1	frame.mean()

1
2
3

one    2.0
two    4.5
dtype: float64

obj.median( axis, skipna, level ) : 求中位数

1	frame.median()

1
2
3

one    2.0
two    4.5
dtype: float64

obj.mad( axis, skipna, level ) : 求绝对离差

1	frame.mad()

1
2
3

one    0.666667
two    0.500000
dtype: float64

obj.var( axis, skipna, level ) : 求方差

1	frame.var()

1
2
3

one    1.0
two    0.5
dtype: float64

obj.std( axis, skipna, level ) : 求标准差

1	frame.std()

1
2
3

one    1.000000
two    0.707107
dtype: float64

obj.skew( axis, skipna, level ) : 求偏度

1	frame.skew()

1
2
3

one    0.0
two    NaN
dtype: float64

obj.kurt( axis, skipna, level ) : 求峰度

1	pd.DataFrame([1,2,3,4,3,2,1,0]).kurt()

1 2	0 -0.7 dtype: float64

obj.quantile( q, axis ) : 求分位数（二分位、四分位、四分之三分位 ... ）

1	frame, frame.quantile(0.5), frame.mean()

(   one  two
 a  1.0  NaN
 b  2.0  4.0
 c  NaN  NaN
 d  3.0  5.0,
 one    2.0
 two    4.5
 Name: 0.5, dtype: float64,
 one    2.0
 two    4.5
 dtype: float64)

1	frame.quantile(0.75) - frame.quantile(0.25) ### 离差IQR

1
2
3

one    1.0
two    0.5
dtype: float64

obj.cumsum( axis, skipna, level ), obj.cumprod( axis, skipna, level ) : 累加, 累乘

frame

	one	two
a	1.0	NaN
b	2.0	4.0
c	NaN	NaN
d	3.0	5.0

1	frame.cumsum()

	one	two
a	1.0	NaN
b	3.0	4.0
c	NaN	NaN
d	6.0	9.0

1	frame.cumprod()

	one	two
a	1.0	NaN
b	2.0	4.0
c	NaN	NaN
d	6.0	20.0

obj.cummin( axis, skipna, level ), obj.cummax( axis, skipna, level ) : 累计最小值和累计最大值

1	frame.cummax()

	one	two
a	1.0	NaN
b	2.0	4.0
c	NaN	NaN
d	3.0	5.0

obj.idmax( axis, skipna, level ), obj.idmin( axis, skipna, level ) : 返回最大值或最小值的索引

frame

	one	two
a	1.0	NaN
b	2.0	4.0
c	NaN	NaN
d	3.0	5.0

1	frame.idxmax()

1
2
3

one    d
two    d
dtype: object

1	frame.idxmax(axis=1)

a    one
b    two
c    NaN
d    two
dtype: object

obj.describe( ) : 返回多个汇总统计

frame

	one	two
a	1.0	NaN
b	2.0	4.0
c	NaN	NaN
d	3.0	5.0

1	frame.describe()

	one	two
count	3.0	2.000000
mean	2.0	4.500000
std	1.0	0.707107
min	1.0	4.000000
25%	1.5	4.250000
50%	2.0	4.500000
75%	2.5	4.750000
max	3.0	5.000000

对于非数值类型

1 2	obj = pd.Series(['a', 'a', 'b', 'c'] * 4) obj.head(), obj.describe()

(0    a
 1    a
 2    b
 3    c
 4    a
 dtype: object,
 count     16
 unique     3
 top        a
 freq       8
 dtype: object)

series1.cov( series2 ), frame.cov( ) : 计算协方差

# 数据准备
price = pd.read_pickle('pydata-book-2nd-edition/examples/yahoo_price.pkl') #股票价格
volume = pd.read_pickle('pydata-book-2nd-edition/examples/yahoo_volume.pkl') #股票成交量
returns = price.pct_change() #计算变化率 : (后一个值 - 前一个值) / 前一个值
returns

	AAPL	GOOG	IBM	MSFT
Date
2010-01-04	NaN	NaN	NaN	NaN
2010-01-05	0.001729	-0.004404	-0.012080	0.000323
2010-01-06	-0.015906	-0.025209	-0.006496	-0.006137
2010-01-07	-0.001849	-0.023280	-0.003462	-0.010400
2010-01-08	0.006648	0.013331	0.010035	0.006897
...	...	...	...	...
2016-10-17	-0.000680	0.001837	0.002072	-0.003483
2016-10-18	-0.000681	0.019616	-0.026168	0.007690
2016-10-19	-0.002979	0.007846	0.003583	-0.002255
2016-10-20	-0.000512	-0.005652	0.001719	-0.004867
2016-10-21	-0.003930	0.003011	-0.012474	0.042096

1714 rows × 4 columns

1	returns['MSFT'].cov(returns['IBM'])

8.870655479703546e-05

1	returns.cov()

	AAPL	GOOG	IBM	MSFT
AAPL	0.000277	0.000107	0.000078	0.000095
GOOG	0.000107	0.000251	0.000078	0.000108
IBM	0.000078	0.000078	0.000146	0.000089
MSFT	0.000095	0.000108	0.000089	0.000215

series1.corr( series2 ), frame.corr( ) : 计算相关系数

1	returns['MSFT'].corr(returns['IBM'])

1	0.4997636114415114

1	returns.corr()

	AAPL	GOOG	IBM	MSFT
AAPL	1.000000	0.407919	0.386817	0.389695
GOOG	0.407919	1.000000	0.405099	0.465919
IBM	0.386817	0.405099	1.000000	0.499764
MSFT	0.389695	0.465919	0.499764	1.000000

obj1.corrwith( obj2, axis ) : 计算 obj1 与 obj2 对应行或列的相关系数

1	returns.corrwith(returns.IBM)

AAPL    0.386817
GOOG    0.405099
IBM     1.000000
MSFT    0.499764
dtype: float64

1	returns.corrwith(volume) # 一一对应

AAPL   -0.075565
GOOG   -0.007067
IBM    -0.204849
MSFT   -0.092950
dtype: float64

1	returns.T.corrwith(volume.T, axis=1)

AAPL   -0.075565
GOOG   -0.007067
IBM    -0.204849
MSFT   -0.092950
dtype: float64

series.unique( series2 ) : 得到 series 唯一值数组

1 2	obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c']) obj.unique()

array(['c', 'a', 'd', 'b'], dtype=object)

obj.isin( list ) : 判断 obj 的元素是否在 list 中

frame = pd.DataFrame([[1, np.nan], [2, 4],
                   [np.nan, np.nan], [3, 5]],
                  index=['a', 'b', 'c', 'd'],
                  columns=['one', 'two'])
frame, frame.isin([1, 2])

(   one  two
 a  1.0  NaN
 b  2.0  4.0
 c  NaN  NaN
 d  3.0  5.0,
      one    two
 a   True  False
 b   True  False
 c  False  False
 d  False  False)

obj.value_counts( normalize, sort ) : 计算 obj 中各值出现的频率

1 2	obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c']) obj.value_counts()

c    3
a    3
b    2
d    1
dtype: int64

frame = pd.DataFrame([[1, 3, 5], [2, 4, 6],
                      [1, 3, 6], [2, 4, 6]],
                     index=['a', 'b', 'c', 'd'],
                     columns=['one', 'two', 'thr'])
frame, frame.value_counts()

(   one  two  thr
 a    1    3    5
 b    2    4    6
 c    1    3    6
 d    2    4    6,
 one  two  thr
 2    4    6      2
 1    3    5      1
           6      1
 dtype: int64)