Pandas - 数据重塑 Pt.2

这篇随笔主要介绍如何利用 Pandas 进行层次化索引、合并数据和重塑数据
Pt.1 部分主要介绍利用 Pandas 进行层次化索引
Pt.2 部分详细介绍利用 Pandas 合并数据
Pt.3 部分主要介绍利用 Pandas 合并数据和重塑数据
Pt.4 部分主要介绍利用 Pandas 重塑数据

1
2
import pandas as pd
import numpy as np

合并数据

pd.merge( left, right, on, how, suffixes )

on : 根据指定列进行合并, 默认为重叠列
how : 连接效果, 默认为 ‘inner’, 其他方式还有 ‘outer’, ‘left’, ‘right’
suffixes : 控制重复列的列名

1
2
3
4
5
df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'], 
'data1': range(7)})
df2 = pd.DataFrame({'key': ['a', 'b', 'd'],
'data2': range(3)})
df1, df2
1
2
3
4
5
6
7
8
9
10
11
12
(  key  data1
0 b 0
1 b 1
2 a 2
3 c 3
4 a 4
5 a 5
6 b 6,
key data2
0 a 0
1 b 1
2 d 2)

on : 控制连接的列

1
2
pd.merge(df1, df2)
# 并没有指明要用哪个列进行连接, 如果没有指定, merge 就会将重叠列的列名当做键
keydata1data2
0b01
1b11
2b61
3a20
4a40
5a50
1
pd.merge(df1, df2, on = 'key')
keydata1data2
0b01
1b11
2b61
3a20
4a40
5a50

根据多个列进行合并

1
2
3
4
5
6
7
left = pd.DataFrame({'key1': ['foo', 'foo', 'bar'],
'key2': ['one', 'two', 'one'],
'lval': [1, 2, 3]})
right = pd.DataFrame({'key1': ['foo', 'foo', 'bar', 'bar'],
'key2': ['one', 'one', 'one', 'two'],
'rval': [4, 5, 6, 7]})
left, right
1
2
3
4
5
6
7
8
9
(  key1 key2  lval
0 foo one 1
1 foo two 2
2 bar one 3,
key1 key2 rval
0 foo one 4
1 foo one 5
2 bar one 6
3 bar two 7)
1
pd.merge(left, right, on=['key1', 'key2'], how='outer')
key1key2lvalrval
0fooone1.04.0
1fooone1.05.0
2footwo2.0NaN
3barone3.06.0
4bartwoNaN7.0

suffixes : 控制合并后重复列的命名

1
pd.merge(left, right, on='key1')
key1key2_xlvalkey2_yrval
0fooone1one4
1fooone1one5
2footwo2one4
3footwo2one5
4barone3one6
5barone3two7
1
pd.merge(left, right, on='key1', suffixes=('_left', '_right'))
key1key2_leftlvalkey2_rightrval
0fooone1one4
1fooone1one5
2footwo2one4
3footwo2one5
4barone3one6
5barone3two7

pd.merge( left, right, left_on, right_on, how, suffixes ) : 分别根据左右指定列进行合并

1
2
3
4
5
df3 = pd.DataFrame({'lkey': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
'data1': range(7)})
df4 = pd.DataFrame({'rkey': ['a', 'b', 'd'],
'data2': range(3)})
df3, df4
1
2
3
4
5
6
7
8
9
10
11
12
(  lkey  data1
0 b 0
1 b 1
2 a 2
3 c 3
4 a 4
5 a 5
6 b 6,
rkey data2
0 a 0
1 b 1
2 d 2)
1
pd.merge(df3, df4, left_on='lkey', right_on='rkey')
lkeydata1rkeydata2
0b0b1
1b1b1
2b6b1
3a2a0
4a4a0
5a5a0
1
2
pd.merge(df3, df4, left_on='lkey', right_on='rkey', how = 'outer')
### 出现了 c 和 d
lkeydata1rkeydata2
0b0.0b1.0
1b1.0b1.0
2b6.0b1.0
3a2.0a0.0
4a4.0a0.0
5a5.0a0.0
6c3.0NaNNaN
7NaNNaNd2.0
1
2
pd.merge(df3, df4, left_on='lkey', right_on='rkey', how = 'left')
# 只有 c
lkeydata1rkeydata2
0b0b1.0
1b1b1.0
2a2a0.0
3c3NaNNaN
4a4a0.0
5a5a0.0
6b6b1.0

pd.merge( left, right, left_index, right_index, how, suffixes ) : 分别根据左右索引进行合并

1
2
3
4
left1 = pd.DataFrame({'key': ['a', 'b', 'a', 'a', 'b', 'c'], 
'value': range(6)})
right1 = pd.DataFrame({'group_val': [3.5, 7]}, index=['a', 'b'])
left1, right1
1
2
3
4
5
6
7
8
9
10
(  key  value
0 a 0
1 b 1
2 a 2
3 a 3
4 b 4
5 c 5,
group_val
a 3.5
b 7.0)
1
pd.merge(left1, right1, left_index=True, right_index=True, how='outer')
keyvaluegroup_val
0a0.0NaN
1b1.0NaN
2a2.0NaN
3a3.0NaN
4b4.0NaN
5c5.0NaN
aNaNNaN3.5
bNaNNaN7.0
1
pd.merge(left1, right1, left_on='key', right_index=True, how='outer')
keyvaluegroup_val
0a03.5
2a23.5
3a33.5
1b17.0
4b47.0
5c5NaN

层次化索引数据的合并 : 索引的合并默认是多键合并

1
2
3
4
5
6
7
8
9
lefth = pd.DataFrame({'key1': ['Ohio', 'Ohio', 'Ohio',
'Nevada', 'Nevada'],
'key2': [2000, 2001, 2002, 2001, 2002],
'data': np.arange(5.)})
righth = pd.DataFrame(np.arange(12).reshape((6, 2)),
index=[['Nevada', 'Nevada', 'Ohio', 'Ohio','Ohio', 'Ohio'],
[2001, 2000, 2000, 2000, 2001, 2002]],
columns=['event1', 'event2'])
lefth, righth
1
2
3
4
5
6
7
8
9
10
11
12
13
(     key1  key2  data
0 Ohio 2000 0.0
1 Ohio 2001 1.0
2 Ohio 2002 2.0
3 Nevada 2001 3.0
4 Nevada 2002 4.0,
event1 event2
Nevada 2001 0 1
2000 2 3
Ohio 2000 4 5
2000 6 7
2001 8 9
2002 10 11)
1
2
pd.merge(lefth, righth, left_on=['key1', 'key2'], right_index=True, how='outer')
# righth 索引的合并默认是多键合并, 所以 lefth 必须以列表的形式指明多个列
key1key2dataevent1event2
0Ohio20000.04.05.0
0Ohio20000.06.07.0
1Ohio20011.08.09.0
2Ohio20022.010.011.0
3Nevada20013.00.01.0
4Nevada20024.0NaNNaN
4Nevada2000NaN2.03.0

left.join( right, how, on ) : 更方便地实现按索引合并, 但要求没有重叠的列

on : 合并时, left 按 on 指定的列合并, right 按索引进行合并
right :

1
2
3
4
5
6
7
left2 = pd.DataFrame([[1., 2.], [3., 4.], [5., 6.]],
index=['a', 'c', 'e'],
columns=['Ohio', 'Nevada'])
right2 = pd.DataFrame([[7., 8.], [9., 10.], [11., 12.], [13, 14]],
index=['b', 'c', 'd', 'e'],
columns=['Missouri', 'Alabama'])
left2, right2
1
2
3
4
5
6
7
8
9
(   Ohio  Nevada
a 1.0 2.0
c 3.0 4.0
e 5.0 6.0,
Missouri Alabama
b 7.0 8.0
c 9.0 10.0
d 11.0 12.0
e 13.0 14.0)
1
left2.join(right2)
OhioNevadaMissouriAlabama
a1.02.0NaNNaN
c3.04.09.010.0
e5.06.013.014.0
1
left1, right1
1
2
3
4
5
6
7
8
9
10
(  key  value
0 a 0
1 b 1
2 a 2
3 a 3
4 b 4
5 c 5,
group_val
a 3.5
b 7.0)
1
left1.join(right1, on='key')
keyvaluegroup_val
0a03.5
1b17.0
2a23.5
3a33.5
4b47.0
5c5NaN

np.concatenate( arrs, axis ) : NumPy 的轴向连接

1
2
arr = np.arange(12).reshape((3, 4))
arr
1
2
3
array([[ 0,  1,  2,  3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
1
np.concatenate([arr, arr], axis=1)
1
2
3
array([[ 0,  1,  2,  3,  0,  1,  2,  3],
[ 4, 5, 6, 7, 4, 5, 6, 7],
[ 8, 9, 10, 11, 8, 9, 10, 11]])