Pandas - 数据重塑 Pt.4

这篇随笔主要介绍如何利用 Pandas 进行层次化索引、合并数据和重塑数据
Pt.1 部分主要介绍利用 Pandas 进行层次化索引
Pt.2 部分详细介绍利用 Pandas 合并数据
Pt.3 部分主要介绍利用 Pandas 合并数据和重塑数据
Pt.4 部分主要介绍利用 Pandas 重塑数据

1
2
import pandas as pd
import numpy as np

数据重塑和轴向旋转

Pt.3 数据重塑和轴向旋转回顾

frame.stack( level ) : 将横轴旋转为竖轴, DataFrame → Series
series.unstack( level ) : 将竖轴旋转为横轴, Series → DataFrame

frame.pivot( index, columns, values ) : 将长格式旋转为宽格式

index 和 columns 分别用作行和列索引, values 用于填充的数据列(可选)

1
2
3
4
5
6
7
8
data = pd.read_csv('pydata-book-2nd-edition/examples/macrodata.csv')
periods = pd.PeriodIndex(year=data.year, quarter=data.quarter,
name='date')
columns = pd.Index(['realgdp', 'infl', 'unemp'], name='item')
data = data.reindex(columns=columns)
data.index = periods.to_timestamp('D', 'end')
ldata = data.stack().reset_index().rename(columns={0: 'value'})
ldata
dateitemvalue
01959-03-31 23:59:59.999999999realgdp2710.349
11959-03-31 23:59:59.999999999infl0.000
21959-03-31 23:59:59.999999999unemp5.800
31959-06-30 23:59:59.999999999realgdp2778.801
41959-06-30 23:59:59.999999999infl2.340
............
6042009-06-30 23:59:59.999999999infl3.370
6052009-06-30 23:59:59.999999999unemp9.200
6062009-09-30 23:59:59.999999999realgdp12990.341
6072009-09-30 23:59:59.999999999infl3.560
6082009-09-30 23:59:59.999999999unemp9.600

609 rows × 3 columns

1
2
ldata.pivot('date', 'item', 'value')
# 以 date 为索引(去除共同项), 将 item 进行拆分旋转, 将 value 填到对应的位置中
iteminflrealgdpunemp
date
1959-03-31 23:59:59.9999999990.002710.3495.8
1959-06-30 23:59:59.9999999992.342778.8015.1
1959-09-30 23:59:59.9999999992.742775.4885.3
1959-12-31 23:59:59.9999999990.272785.2045.6
1960-03-31 23:59:59.9999999992.312847.6995.2
............
2008-09-30 23:59:59.999999999-3.1613324.6006.0
2008-12-31 23:59:59.999999999-8.7913141.9206.9
2009-03-31 23:59:59.9999999990.9412925.4108.1
2009-06-30 23:59:59.9999999993.3712901.5049.2
2009-09-30 23:59:59.9999999993.5612990.3419.6

203 rows × 3 columns

假设有两个需要同时重塑的数据列

1
2
ldata['value2'] = np.random.randn(len(ldata))
ldata
dateitemvaluevalue2
01959-03-31 23:59:59.999999999realgdp2710.3490.176200
11959-03-31 23:59:59.999999999infl0.000-0.194493
21959-03-31 23:59:59.999999999unemp5.800-0.363329
31959-06-30 23:59:59.999999999realgdp2778.801-1.510863
41959-06-30 23:59:59.999999999infl2.3400.144886
...............
6042009-06-30 23:59:59.999999999infl3.370-0.936350
6052009-06-30 23:59:59.999999999unemp9.2001.452993
6062009-09-30 23:59:59.999999999realgdp12990.341-0.137653
6072009-09-30 23:59:59.999999999infl3.560-0.423567
6082009-09-30 23:59:59.999999999unemp9.6000.086320

609 rows × 4 columns

1
ldata.pivot('date', 'item')
valuevalue2
iteminflrealgdpunempinflrealgdpunemp
date
1959-03-31 23:59:59.9999999990.002710.3495.8-0.1944930.176200-0.363329
1959-06-30 23:59:59.9999999992.342778.8015.10.144886-1.5108630.619092
1959-09-30 23:59:59.9999999992.742775.4885.3-1.5198181.400742-0.669686
1959-12-31 23:59:59.9999999990.272785.2045.6-0.0270591.053187-1.069302
1960-03-31 23:59:59.9999999992.312847.6995.2-1.157763-1.174653-0.562152
.....................
2008-09-30 23:59:59.999999999-3.1613324.6006.00.632464-0.4470701.267958
2008-12-31 23:59:59.999999999-8.7913141.9206.9-1.030839-0.3594840.606553
2009-03-31 23:59:59.9999999990.9412925.4108.1-1.4476970.711061-0.873898
2009-06-30 23:59:59.9999999993.3712901.5049.2-0.9363500.3955911.452993
2009-09-30 23:59:59.9999999993.5612990.3419.6-0.423567-0.1376530.086320

203 rows × 6 columns

1
ldata.pivot('date', 'item', ['value', 'value2'])
valuevalue2
iteminflrealgdpunempinflrealgdpunemp
date
1959-03-31 23:59:59.9999999990.002710.3495.8-0.1944930.176200-0.363329
1959-06-30 23:59:59.9999999992.342778.8015.10.144886-1.5108630.619092
1959-09-30 23:59:59.9999999992.742775.4885.3-1.5198181.400742-0.669686
1959-12-31 23:59:59.9999999990.272785.2045.6-0.0270591.053187-1.069302
1960-03-31 23:59:59.9999999992.312847.6995.2-1.157763-1.174653-0.562152
.....................
2008-09-30 23:59:59.999999999-3.1613324.6006.00.632464-0.4470701.267958
2008-12-31 23:59:59.999999999-8.7913141.9206.9-1.030839-0.3594840.606553
2009-03-31 23:59:59.9999999990.9412925.4108.1-1.4476970.711061-0.873898
2009-06-30 23:59:59.9999999993.3712901.5049.2-0.9363500.3955911.452993
2009-09-30 23:59:59.9999999993.5612990.3419.6-0.423567-0.1376530.086320

203 rows × 6 columns

pivot 其实就是用 set_index 创建层次化索引, 再用 unstack 重塑

1
ldata.set_index(['date', 'item']).unstack('item')
valuevalue2
iteminflrealgdpunempinflrealgdpunemp
date
1959-03-31 23:59:59.9999999990.002710.3495.8-0.1944930.176200-0.363329
1959-06-30 23:59:59.9999999992.342778.8015.10.144886-1.5108630.619092
1959-09-30 23:59:59.9999999992.742775.4885.3-1.5198181.400742-0.669686
1959-12-31 23:59:59.9999999990.272785.2045.6-0.0270591.053187-1.069302
1960-03-31 23:59:59.9999999992.312847.6995.2-1.157763-1.174653-0.562152
.....................
2008-09-30 23:59:59.999999999-3.1613324.6006.00.632464-0.4470701.267958
2008-12-31 23:59:59.999999999-8.7913141.9206.9-1.030839-0.3594840.606553
2009-03-31 23:59:59.9999999990.9412925.4108.1-1.4476970.711061-0.873898
2009-06-30 23:59:59.9999999993.3712901.5049.2-0.9363500.3955911.452993
2009-09-30 23:59:59.9999999993.5612990.3419.6-0.423567-0.1376530.086320

203 rows × 6 columns

pd.melt( frame, id_vars, value_vars ) : 将宽格式旋转为长格式

id_vars 为分组指标, value_vars 为合并到一起的列

1
2
3
4
5
df = pd.DataFrame({'key': ['foo', 'bar', 'baz'],
'A': [1, 2, 3],
'B': [4, 5, 6],
'C': [7, 8, 9]})
df
keyABC
0foo147
1bar258
2baz369
1
pd.melt(df, ['key'], ['A', 'B'])
keyvariablevalue
0fooA1
1barA2
2bazA3
3fooB4
4barB5
5bazB6
1
pd.melt(df, ['A', 'B'], ['C', 'key'])
ABvariablevalue
014C7
125C8
236C9
314keyfoo
425keybar
536keybaz