这篇随笔主要介绍如何利用 Pandas 进行层次化索引、合并数据和重塑数据 Pt.1 部分主要介绍利用 Pandas 进行层次化索引 Pt.2 部分详细介绍利用 Pandas 合并数据 Pt.3 部分主要介绍利用 Pandas 合并数据和重塑数据 Pt.4 部分主要介绍利用 Pandas 重塑数据
1 2 import pandas as pdimport numpy as np
合并数据 pd.merge( left, right, on, how, suffixes ) pd.merge( left, right, left_on, right_on, how, suffixes ) : 分别根据左右指定列进行合并pd.merge( left, right, left_index, right_index, how, suffixes ) : 分别根据左右索引进行合并left.join( right, how, on ) : 更方便地实现按索引合并, 但要求没有重叠的列np.concatenate( arrs, axis ) : NumPy 的轴向连接
pd.concat( objs, axis, join, keys, ignore_index ) : pandas 的轴向连接, 可以将值和索引连在一起1 2 3 4 s1 = pd.Series([0 , 1 ], index=['a' , 'b' ]) s2 = pd.Series([2 , 3 , 4 ], index=['a' , 'b' , 'e' ]) s3 = pd.Series([5 , 6 ], index=['a' , 'b' ]) s1, s2, s3
1 2 3 4 5 6 7 8 9 10 (a 0 b 1 dtype: int64, a 2 b 3 e 4 dtype: int64, a 5 b 6 dtype: int64)
1 2 3 4 5 6 7 8 a 0 b 1 a 2 b 3 e 4 a 5 b 6 dtype: int64
1 pd.concat([s1, s2, s3], axis=1 )
0 1 2 a 0.0 2 5.0 b 1.0 3 6.0 e NaN 4 NaN
join : 默认为 'outer'1 pd.concat([s1, s2, s3], axis=1 , join='inner' )
1 pd.concat([s1, s2, s3], axis=1 , join='outer' )
0 1 2 a 0.0 2 5.0 b 1.0 3 6.0 e NaN 4 NaN
keys 沿着 axis=0 对 series 进行合并, 在连接轴创建一个层次化索引
1 pd.concat([s1, s1, s3], keys=['one' ,'two' , 'three' ])
1 2 3 4 5 6 7 one a 0 b 1 two a 0 b 1 three a 5 b 6 dtype: int64
沿着 axis=1 对 series 进行合并, keys 为列头
1 pd.concat([s1, s1, s3], axis=1 , keys=['one' ,'two' , 'three' ])
同样的逻辑适用于 DataFrame
1 2 3 4 5 df1 = pd.DataFrame(np.arange(6 ).reshape(3 , 2 ), index=['a' , 'b' , 'c' ], columns=['one' , 'two' ]) df2 = pd.DataFrame(5 + np.arange(4 ).reshape(2 , 2 ), index=['a' , 'c' ], columns=['three' , 'four' ]) df1, df2
1 2 3 4 5 6 7 ( one two a 0 1 b 2 3 c 4 5, three four a 5 6 c 7 8)
1 pd.concat([df1, df2], axis=1 , keys=['level1' , 'level2' ])
level1 level2 one two three four a 0 1 5.0 6.0 b 2 3 NaN NaN c 4 5 7.0 8.0
如果传入的 objs 不是列表而是一个字典, 则字典的键就会被当做 keys 的值
1 pd.concat({'level1' : df1, 'level2' : df2}, axis=1 )
level1 level2 one two three four a 0 1 5.0 6.0 b 2 3 NaN NaN c 4 5 7.0 8.0
ignore_index : 不保留连接轴上的索引, 而是产生一组新索引1 2 3 df1 = pd.DataFrame(np.random.randn(3 , 4 ), columns=['a' , 'b' , 'c' , 'd' ]) df2 = pd.DataFrame(np.random.randn(2 , 3 ), columns=['b' , 'd' , 'a' ]) df1, df2
1 2 3 4 5 6 7 ( a b c d 0 -0.384417 -0.394028 0.098734 0.325203 1 2.274706 -1.128029 -2.126771 0.412944 2 1.957194 1.825529 0.009230 -0.479239, b d a 0 -0.757975 -1.809006 -0.396801 1 0.112866 2.343804 0.815144)
1 pd.concat([df1, df2], ignore_index=False )
a b c d 0 -0.384417 -0.394028 0.098734 0.325203 1 2.274706 -1.128029 -2.126771 0.412944 2 1.957194 1.825529 0.009230 -0.479239 0 -0.396801 -0.757975 NaN -1.809006 1 0.815144 0.112866 NaN 2.343804
1 pd.concat([df1, df2], ignore_index=True )
a b c d 0 -0.384417 -0.394028 0.098734 0.325203 1 2.274706 -1.128029 -2.126771 0.412944 2 1.957194 1.825529 0.009230 -0.479239 3 -0.396801 -0.757975 NaN -1.809006 4 0.815144 0.112866 NaN 2.343804
np.where( condition, arr1, arr2 ) : Numpy 合并索引全部或部分重叠的数据的方法1 2 3 4 5 6 a = pd.Series([np.nan, 2.5 , np.nan, 3.5 , 4.5 , np.nan], index=['f' , 'e' , 'd' , 'c' , 'b' , 'a' ]) b = pd.Series(np.arange(len (a), dtype=np.float64), index=['f' , 'e' , 'd' , 'c' , 'b' , 'a' ]) b[-1 ] = np.nan a, b
1 2 3 4 5 6 7 8 9 10 11 12 13 14 (f NaN e 2.5 d NaN c 3.5 b 4.5 a NaN dtype: float64, f 0.0 e 1.0 d 2.0 c 3.0 b 4.0 a NaN dtype: float64)
1 np.where(pd.notnull(a), a, b)
array([0. , 2.5, 2. , 3.5, 4.5, nan])
obj1.combine_first( obj2 ) : pandas 合并索引全部或部分重叠的数据的方法1 2 3 4 5 6 7 f 0.0 e 2.5 d 2.0 c 3.5 b 4.5 a NaN dtype: float64
1 2 3 4 5 6 df1 = pd.DataFrame({'a' : [1. , np.nan, 5. , np.nan], 'b' : [np.nan, 2. , np.nan, 6. ], 'c' : range (2 , 18 , 4 )}) df2 = pd.DataFrame({'a' : [5. , 4. , np.nan, 3. , 7. ], 'b' : [np.nan, 3. , 4. , 6. , 8. ]}) df1, df2
1 2 3 4 5 6 7 8 9 10 11 ( a b c 0 1.0 NaN 2 1 NaN 2.0 6 2 5.0 NaN 10 3 NaN 6.0 14, a b 0 5.0 NaN 1 4.0 3.0 2 NaN 4.0 3 3.0 6.0 4 7.0 8.0)
a b c 0 1.0 NaN 2.0 1 4.0 2.0 6.0 2 5.0 4.0 10.0 3 3.0 6.0 14.0 4 7.0 8.0 NaN
数据重塑和轴向旋转 进行 unstack 或 stack 时, 返回的结果中, 作为 旋转轴(即 level ) 的级别将成为 最低级别
frame.stack( level ) : 将横轴旋转为竖轴, DataFrame → Series1 2 3 4 5 frame = pd.DataFrame(np.arange(6 ).reshape((2 , 3 )), index=pd.Index(['Ohio' , 'Colorado' ], name='state' ), columns=pd.Index(['one' , 'two' , 'three' ], name='number' )) frame
number one two three state Ohio 0 1 2 Colorado 3 4 5
1 2 series = frame.stack() series
1 2 3 4 5 6 7 8 state number Ohio one 0 two 1 three 2 Colorado one 3 two 4 three 5 dtype: int32
level : 进行 unstack 或 stack 时, 返回的结果中, 作为 旋转轴(即 level ) 的级别将成为 最低级别
1 2 3 4 5 df1 = pd.DataFrame( {'left' : series, 'right' : series + 5 }, columns = pd.Index(['left' , 'right' ], name='side' ) ) df1
side left right state number Ohio one 0 5 two 1 6 three 2 7 Colorado one 3 8 two 4 9 three 5 10
1 2 df2 = df1.unstack('state' ) df2
side left right state Ohio Colorado Ohio Colorado number one 0 3 5 8 two 1 4 6 9 three 2 5 7 10
1 2 df3 = df2.stack('side' ) df3
state Colorado Ohio number side one left 3 0 right 8 5 two left 4 1 right 9 6 three left 5 2 right 10 7
series.unstack( level ) : 将竖轴旋转为横轴, Series → DataFrame1 2 3 4 5 6 7 8 state number Ohio one 0 two 1 three 2 Colorado one 3 two 4 three 5 dtype: int32
number one two three state Ohio 0 1 2 Colorado 3 4 5
state Ohio Colorado number one 0 3 two 1 4 three 2 5
1 series.unstack(level='state' )
state Ohio Colorado number one 0 3 two 1 4 three 2 5
如果不是所有的索引都能在各分组中找到的话, 则 unstack 可能会引入缺失值
1 2 3 4 s1 = pd.Series([0 , 1 , 2 , 3 ], index=['a' , 'b' , 'c' , 'd' ]) s2 = pd.Series([4 , 5 , 6 ], index=['c' , 'd' , 'e' ]) s = pd.concat([s1, s2], keys=['one' , 'two' ]) s
1 2 3 4 5 6 7 8 one a 0 b 1 c 2 d 3 two c 4 d 5 e 6 dtype: int64
a b c d e one 0.0 1.0 2.0 3.0 NaN two NaN NaN 4.0 5.0 6.0
stack 会默认过滤掉缺失值
1 2 3 4 5 6 7 8 one a 0.0 b 1.0 c 2.0 d 3.0 two c 4.0 d 5.0 e 6.0 dtype: float64
可以利用 dropna 控制是否过滤缺失值
1 df.stack(dropna = False )
1 2 3 4 5 6 7 8 9 10 11 one a 0.0 b 1.0 c 2.0 d 3.0 e NaN two a NaN b NaN c 4.0 d 5.0 e 6.0 dtype: float64