Pandas - 数据重塑 Pt.2
这篇随笔主要介绍如何利用 Pandas 进行层次化索引、合并数据和重塑数据
Pt.1 部分主要介绍利用 Pandas 进行层次化索引
Pt.2 部分详细介绍利用 Pandas 合并数据
Pt.3 部分主要介绍利用 Pandas 合并数据和重塑数据
Pt.4 部分主要介绍利用 Pandas 重塑数据
1 | import pandas as pd |
合并数据
pd.merge( left, right, on, how, suffixes )
on : 根据指定列进行合并, 默认为重叠列
how : 连接效果, 默认为 ‘inner’, 其他方式还有 ‘outer’, ‘left’, ‘right’
suffixes : 控制重复列的列名
1 | df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'], |
1 | ( key data1 |
on : 控制连接的列
1 | pd.merge(df1, df2) |
key | data1 | data2 | |
---|---|---|---|
0 | b | 0 | 1 |
1 | b | 1 | 1 |
2 | b | 6 | 1 |
3 | a | 2 | 0 |
4 | a | 4 | 0 |
5 | a | 5 | 0 |
1 | pd.merge(df1, df2, on = 'key') |
key | data1 | data2 | |
---|---|---|---|
0 | b | 0 | 1 |
1 | b | 1 | 1 |
2 | b | 6 | 1 |
3 | a | 2 | 0 |
4 | a | 4 | 0 |
5 | a | 5 | 0 |
根据多个列进行合并
1 | left = pd.DataFrame({'key1': ['foo', 'foo', 'bar'], |
1 | ( key1 key2 lval |
1 | pd.merge(left, right, on=['key1', 'key2'], how='outer') |
key1 | key2 | lval | rval | |
---|---|---|---|---|
0 | foo | one | 1.0 | 4.0 |
1 | foo | one | 1.0 | 5.0 |
2 | foo | two | 2.0 | NaN |
3 | bar | one | 3.0 | 6.0 |
4 | bar | two | NaN | 7.0 |
suffixes : 控制合并后重复列的命名
1 | pd.merge(left, right, on='key1') |
key1 | key2_x | lval | key2_y | rval | |
---|---|---|---|---|---|
0 | foo | one | 1 | one | 4 |
1 | foo | one | 1 | one | 5 |
2 | foo | two | 2 | one | 4 |
3 | foo | two | 2 | one | 5 |
4 | bar | one | 3 | one | 6 |
5 | bar | one | 3 | two | 7 |
1 | pd.merge(left, right, on='key1', suffixes=('_left', '_right')) |
key1 | key2_left | lval | key2_right | rval | |
---|---|---|---|---|---|
0 | foo | one | 1 | one | 4 |
1 | foo | one | 1 | one | 5 |
2 | foo | two | 2 | one | 4 |
3 | foo | two | 2 | one | 5 |
4 | bar | one | 3 | one | 6 |
5 | bar | one | 3 | two | 7 |
pd.merge( left, right, left_on, right_on, how, suffixes ) : 分别根据左右指定列进行合并
1 | df3 = pd.DataFrame({'lkey': ['b', 'b', 'a', 'c', 'a', 'a', 'b'], |
1 | ( lkey data1 |
1 | pd.merge(df3, df4, left_on='lkey', right_on='rkey') |
lkey | data1 | rkey | data2 | |
---|---|---|---|---|
0 | b | 0 | b | 1 |
1 | b | 1 | b | 1 |
2 | b | 6 | b | 1 |
3 | a | 2 | a | 0 |
4 | a | 4 | a | 0 |
5 | a | 5 | a | 0 |
1 | pd.merge(df3, df4, left_on='lkey', right_on='rkey', how = 'outer') |
lkey | data1 | rkey | data2 | |
---|---|---|---|---|
0 | b | 0.0 | b | 1.0 |
1 | b | 1.0 | b | 1.0 |
2 | b | 6.0 | b | 1.0 |
3 | a | 2.0 | a | 0.0 |
4 | a | 4.0 | a | 0.0 |
5 | a | 5.0 | a | 0.0 |
6 | c | 3.0 | NaN | NaN |
7 | NaN | NaN | d | 2.0 |
1 | pd.merge(df3, df4, left_on='lkey', right_on='rkey', how = 'left') |
lkey | data1 | rkey | data2 | |
---|---|---|---|---|
0 | b | 0 | b | 1.0 |
1 | b | 1 | b | 1.0 |
2 | a | 2 | a | 0.0 |
3 | c | 3 | NaN | NaN |
4 | a | 4 | a | 0.0 |
5 | a | 5 | a | 0.0 |
6 | b | 6 | b | 1.0 |
pd.merge( left, right, left_index, right_index, how, suffixes ) : 分别根据左右索引进行合并
1 | left1 = pd.DataFrame({'key': ['a', 'b', 'a', 'a', 'b', 'c'], |
1 | ( key value |
1 | pd.merge(left1, right1, left_index=True, right_index=True, how='outer') |
key | value | group_val | |
---|---|---|---|
0 | a | 0.0 | NaN |
1 | b | 1.0 | NaN |
2 | a | 2.0 | NaN |
3 | a | 3.0 | NaN |
4 | b | 4.0 | NaN |
5 | c | 5.0 | NaN |
a | NaN | NaN | 3.5 |
b | NaN | NaN | 7.0 |
1 | pd.merge(left1, right1, left_on='key', right_index=True, how='outer') |
key | value | group_val | |
---|---|---|---|
0 | a | 0 | 3.5 |
2 | a | 2 | 3.5 |
3 | a | 3 | 3.5 |
1 | b | 1 | 7.0 |
4 | b | 4 | 7.0 |
5 | c | 5 | NaN |
层次化索引数据的合并 : 索引的合并默认是多键合并
1 | lefth = pd.DataFrame({'key1': ['Ohio', 'Ohio', 'Ohio', |
1 | ( key1 key2 data |
1 | pd.merge(lefth, righth, left_on=['key1', 'key2'], right_index=True, how='outer') |
key1 | key2 | data | event1 | event2 | |
---|---|---|---|---|---|
0 | Ohio | 2000 | 0.0 | 4.0 | 5.0 |
0 | Ohio | 2000 | 0.0 | 6.0 | 7.0 |
1 | Ohio | 2001 | 1.0 | 8.0 | 9.0 |
2 | Ohio | 2002 | 2.0 | 10.0 | 11.0 |
3 | Nevada | 2001 | 3.0 | 0.0 | 1.0 |
4 | Nevada | 2002 | 4.0 | NaN | NaN |
4 | Nevada | 2000 | NaN | 2.0 | 3.0 |
left.join( right, how, on ) : 更方便地实现按索引合并, 但要求没有重叠的列
on : 合并时, left 按 on 指定的列合并, right 按索引进行合并
right :
1 | left2 = pd.DataFrame([[1., 2.], [3., 4.], [5., 6.]], |
1 | ( Ohio Nevada |
1 | left2.join(right2) |
Ohio | Nevada | Missouri | Alabama | |
---|---|---|---|---|
a | 1.0 | 2.0 | NaN | NaN |
c | 3.0 | 4.0 | 9.0 | 10.0 |
e | 5.0 | 6.0 | 13.0 | 14.0 |
1 | left1, right1 |
1 | ( key value |
1 | left1.join(right1, on='key') |
key | value | group_val | |
---|---|---|---|
0 | a | 0 | 3.5 |
1 | b | 1 | 7.0 |
2 | a | 2 | 3.5 |
3 | a | 3 | 3.5 |
4 | b | 4 | 7.0 |
5 | c | 5 | NaN |
np.concatenate( arrs, axis ) : NumPy 的轴向连接
1 | arr = np.arange(12).reshape((3, 4)) |
1 | array([[ 0, 1, 2, 3], |
1 | np.concatenate([arr, arr], axis=1) |
1 | array([[ 0, 1, 2, 3, 0, 1, 2, 3], |