Pandas - 数据重塑 Pt.3

这篇随笔主要介绍如何利用 Pandas 进行层次化索引、合并数据和重塑数据
Pt.1 部分主要介绍利用 Pandas 进行层次化索引
Pt.2 部分详细介绍利用 Pandas 合并数据
Pt.3 部分主要介绍利用 Pandas 合并数据和重塑数据
Pt.4 部分主要介绍利用 Pandas 重塑数据

1
2
import pandas as pd
import numpy as np

合并数据

Pt.2 合并数据回顾

pd.merge( left, right, on, how, suffixes )
pd.merge( left, right, left_on, right_on, how, suffixes ) : 分别根据左右指定列进行合并
pd.merge( left, right, left_index, right_index, how, suffixes ) : 分别根据左右索引进行合并
left.join( right, how, on ) : 更方便地实现按索引合并, 但要求没有重叠的列
np.concatenate( arrs, axis ) : NumPy 的轴向连接

pd.concat( objs, axis, join, keys, ignore_index ) : pandas 的轴向连接, 可以将值和索引连在一起

1
2
3
4
s1 = pd.Series([0, 1], index=['a', 'b'])
s2 = pd.Series([2, 3, 4], index=['a', 'b', 'e'])
s3 = pd.Series([5, 6], index=['a', 'b'])
s1, s2, s3
1
2
3
4
5
6
7
8
9
10
(a    0
b 1
dtype: int64,
a 2
b 3
e 4
dtype: int64,
a 5
b 6
dtype: int64)
1
pd.concat([s1, s2, s3])
1
2
3
4
5
6
7
8
a    0
b 1
a 2
b 3
e 4
a 5
b 6
dtype: int64
1
pd.concat([s1, s2, s3], axis=1)
012
a0.025.0
b1.036.0
eNaN4NaN

join : 默认为 'outer'

1
pd.concat([s1, s2, s3], axis=1, join='inner')
012
a025
b136
1
pd.concat([s1, s2, s3], axis=1, join='outer') 
012
a0.025.0
b1.036.0
eNaN4NaN

keys

沿着 axis=0 对 series 进行合并, 在连接轴创建一个层次化索引

1
pd.concat([s1, s1, s3], keys=['one','two', 'three'])
1
2
3
4
5
6
7
one    a    0
b 1
two a 0
b 1
three a 5
b 6
dtype: int64

沿着 axis=1 对 series 进行合并, keys 为列头

1
pd.concat([s1, s1, s3], axis=1, keys=['one','two', 'three'])
onetwothree
a005
b116

同样的逻辑适用于 DataFrame

1
2
3
4
5
df1 = pd.DataFrame(np.arange(6).reshape(3, 2), index=['a', 'b', 'c'],
columns=['one', 'two'])
df2 = pd.DataFrame(5 + np.arange(4).reshape(2, 2), index=['a', 'c'],
columns=['three', 'four'])
df1, df2
1
2
3
4
5
6
7
(   one  two
a 0 1
b 2 3
c 4 5,
three four
a 5 6
c 7 8)
1
pd.concat([df1, df2], axis=1, keys=['level1', 'level2'])
level1level2
onetwothreefour
a015.06.0
b23NaNNaN
c457.08.0

如果传入的 objs 不是列表而是一个字典, 则字典的键就会被当做 keys 的值

1
pd.concat({'level1': df1, 'level2': df2}, axis=1)
level1level2
onetwothreefour
a015.06.0
b23NaNNaN
c457.08.0

ignore_index : 不保留连接轴上的索引, 而是产生一组新索引

1
2
3
df1 = pd.DataFrame(np.random.randn(3, 4), columns=['a', 'b', 'c', 'd'])
df2 = pd.DataFrame(np.random.randn(2, 3), columns=['b', 'd', 'a'])
df1, df2
1
2
3
4
5
6
7
(          a         b         c         d
0 -0.384417 -0.394028 0.098734 0.325203
1 2.274706 -1.128029 -2.126771 0.412944
2 1.957194 1.825529 0.009230 -0.479239,
b d a
0 -0.757975 -1.809006 -0.396801
1 0.112866 2.343804 0.815144)
1
pd.concat([df1, df2], ignore_index=False)
abcd
0-0.384417-0.3940280.0987340.325203
12.274706-1.128029-2.1267710.412944
21.9571941.8255290.009230-0.479239
0-0.396801-0.757975NaN-1.809006
10.8151440.112866NaN2.343804
1
pd.concat([df1, df2], ignore_index=True)
abcd
0-0.384417-0.3940280.0987340.325203
12.274706-1.128029-2.1267710.412944
21.9571941.8255290.009230-0.479239
3-0.396801-0.757975NaN-1.809006
40.8151440.112866NaN2.343804

np.where( condition, arr1, arr2 ) : Numpy 合并索引全部或部分重叠的数据的方法

1
2
3
4
5
6
a = pd.Series([np.nan, 2.5, np.nan, 3.5, 4.5, np.nan],
index=['f', 'e', 'd', 'c', 'b', 'a'])
b = pd.Series(np.arange(len(a), dtype=np.float64),
index=['f', 'e', 'd', 'c', 'b', 'a'])
b[-1] = np.nan
a, b
1
2
3
4
5
6
7
8
9
10
11
12
13
14
(f    NaN
e 2.5
d NaN
c 3.5
b 4.5
a NaN
dtype: float64,
f 0.0
e 1.0
d 2.0
c 3.0
b 4.0
a NaN
dtype: float64)
1
np.where(pd.notnull(a), a, b)

array([0. , 2.5, 2. , 3.5, 4.5, nan])

obj1.combine_first( obj2 ) : pandas 合并索引全部或部分重叠的数据的方法

1
2
a.combine_first(b)
# 将 b 覆盖到 a 上, 值已经存在的部分不填充, 值为缺失的部分填充
1
2
3
4
5
6
7
f    0.0
e 2.5
d 2.0
c 3.5
b 4.5
a NaN
dtype: float64
1
2
3
4
5
6
df1 = pd.DataFrame({'a': [1., np.nan, 5., np.nan],
'b': [np.nan, 2., np.nan, 6.],
'c': range(2, 18, 4)})
df2 = pd.DataFrame({'a': [5., 4., np.nan, 3., 7.],
'b': [np.nan, 3., 4., 6., 8.]})
df1, df2
1
2
3
4
5
6
7
8
9
10
11
(     a    b   c
0 1.0 NaN 2
1 NaN 2.0 6
2 5.0 NaN 10
3 NaN 6.0 14,
a b
0 5.0 NaN
1 4.0 3.0
2 NaN 4.0
3 3.0 6.0
4 7.0 8.0)
1
2
df1.combine_first(df2) 
# 将 df2 覆盖到 df1 上, 值已经存在的部分不填充, 值为缺失的部分填充
abc
01.0NaN2.0
14.02.06.0
25.04.010.0
33.06.014.0
47.08.0NaN

数据重塑和轴向旋转

进行 unstack 或 stack 时, 返回的结果中, 作为 旋转轴(即 level 的级别将成为 最低级别

frame.stack( level ) : 将横轴旋转为竖轴, DataFrame → Series

1
2
3
4
5
frame = pd.DataFrame(np.arange(6).reshape((2, 3)),
index=pd.Index(['Ohio', 'Colorado'], name='state'),
columns=pd.Index(['one', 'two', 'three'],
name='number'))
frame
numberonetwothree
state
Ohio012
Colorado345
1
2
series = frame.stack()
series
1
2
3
4
5
6
7
8
state     number
Ohio one 0
two 1
three 2
Colorado one 3
two 4
three 5
dtype: int32

level : 进行 unstack 或 stack 时, 返回的结果中, 作为 旋转轴(即 level 的级别将成为 最低级别

1
2
3
4
5
df1 = pd.DataFrame(
{'left': series, 'right': series + 5},
columns = pd.Index(['left', 'right'], name='side')
)
df1
sideleftright
statenumber
Ohioone05
two16
three27
Coloradoone38
two49
three510
1
2
df2 = df1.unstack('state')
df2
sideleftright
stateOhioColoradoOhioColorado
number
one0358
two1469
three25710
1
2
df3 = df2.stack('side')
df3
stateColoradoOhio
numberside
oneleft30
right85
twoleft41
right96
threeleft52
right107

series.unstack( level ) : 将竖轴旋转为横轴, Series → DataFrame

1
series
1
2
3
4
5
6
7
8
state     number
Ohio one 0
two 1
three 2
Colorado one 3
two 4
three 5
dtype: int32
1
series.unstack()
numberonetwothree
state
Ohio012
Colorado345
1
series.unstack(level=0)
stateOhioColorado
number
one03
two14
three25
1
series.unstack(level='state')
stateOhioColorado
number
one03
two14
three25

如果不是所有的索引都能在各分组中找到的话, 则 unstack 可能会引入缺失值

1
2
3
4
s1 = pd.Series([0, 1, 2, 3], index=['a', 'b', 'c', 'd'])
s2 = pd.Series([4, 5, 6], index=['c', 'd', 'e'])
s = pd.concat([s1, s2], keys=['one', 'two'])
s
1
2
3
4
5
6
7
8
one  a    0
b 1
c 2
d 3
two c 4
d 5
e 6
dtype: int64
1
2
3
df = s.unstack()
# two 分组中找不到 a, b 这两个索引
df
abcde
one0.01.02.03.0NaN
twoNaNNaN4.05.06.0

stack 会默认过滤掉缺失值

1
df.stack()
1
2
3
4
5
6
7
8
one  a    0.0
b 1.0
c 2.0
d 3.0
two c 4.0
d 5.0
e 6.0
dtype: float64

可以利用 dropna 控制是否过滤缺失值

1
df.stack(dropna = False)
1
2
3
4
5
6
7
8
9
10
11
one  a    0.0
b 1.0
c 2.0
d 3.0
e NaN
two a NaN
b NaN
c 4.0
d 5.0
e 6.0
dtype: float64