这篇随笔主要介绍如何利用 Pandas 进行层次化索引、合并数据和重塑数据
Pt.1 部分主要介绍利用 Pandas 进行层次化索引
Pt.2 部分详细介绍利用 Pandas 合并数据
Pt.3 部分主要介绍利用 Pandas 合并数据和重塑数据
Pt.4 部分主要介绍利用 Pandas 重塑数据
1 2
| import pandas as pd import numpy as np
|
数据重塑和轴向旋转
frame.stack( level ) : 将横轴旋转为竖轴, DataFrame → Series
series.unstack( level ) : 将竖轴旋转为横轴, Series → DataFrame
frame.pivot( index, columns, values ) : 将长格式旋转为宽格式
index 和 columns 分别用作行和列索引, values 用于填充的数据列(可选)
1 2 3 4 5 6 7 8
| data = pd.read_csv('pydata-book-2nd-edition/examples/macrodata.csv') periods = pd.PeriodIndex(year=data.year, quarter=data.quarter, name='date') columns = pd.Index(['realgdp', 'infl', 'unemp'], name='item') data = data.reindex(columns=columns) data.index = periods.to_timestamp('D', 'end') ldata = data.stack().reset_index().rename(columns={0: 'value'}) ldata
|
| date | item | value |
---|
0 | 1959-03-31 23:59:59.999999999 | realgdp | 2710.349 |
---|
1 | 1959-03-31 23:59:59.999999999 | infl | 0.000 |
---|
2 | 1959-03-31 23:59:59.999999999 | unemp | 5.800 |
---|
3 | 1959-06-30 23:59:59.999999999 | realgdp | 2778.801 |
---|
4 | 1959-06-30 23:59:59.999999999 | infl | 2.340 |
---|
... | ... | ... | ... |
---|
604 | 2009-06-30 23:59:59.999999999 | infl | 3.370 |
---|
605 | 2009-06-30 23:59:59.999999999 | unemp | 9.200 |
---|
606 | 2009-09-30 23:59:59.999999999 | realgdp | 12990.341 |
---|
607 | 2009-09-30 23:59:59.999999999 | infl | 3.560 |
---|
608 | 2009-09-30 23:59:59.999999999 | unemp | 9.600 |
---|
609 rows × 3 columns
1 2
| ldata.pivot('date', 'item', 'value')
|
item | infl | realgdp | unemp |
---|
date | | | |
---|
1959-03-31 23:59:59.999999999 | 0.00 | 2710.349 | 5.8 |
---|
1959-06-30 23:59:59.999999999 | 2.34 | 2778.801 | 5.1 |
---|
1959-09-30 23:59:59.999999999 | 2.74 | 2775.488 | 5.3 |
---|
1959-12-31 23:59:59.999999999 | 0.27 | 2785.204 | 5.6 |
---|
1960-03-31 23:59:59.999999999 | 2.31 | 2847.699 | 5.2 |
---|
... | ... | ... | ... |
---|
2008-09-30 23:59:59.999999999 | -3.16 | 13324.600 | 6.0 |
---|
2008-12-31 23:59:59.999999999 | -8.79 | 13141.920 | 6.9 |
---|
2009-03-31 23:59:59.999999999 | 0.94 | 12925.410 | 8.1 |
---|
2009-06-30 23:59:59.999999999 | 3.37 | 12901.504 | 9.2 |
---|
2009-09-30 23:59:59.999999999 | 3.56 | 12990.341 | 9.6 |
---|
203 rows × 3 columns
假设有两个需要同时重塑的数据列
1 2
| ldata['value2'] = np.random.randn(len(ldata)) ldata
|
| date | item | value | value2 |
---|
0 | 1959-03-31 23:59:59.999999999 | realgdp | 2710.349 | 0.176200 |
---|
1 | 1959-03-31 23:59:59.999999999 | infl | 0.000 | -0.194493 |
---|
2 | 1959-03-31 23:59:59.999999999 | unemp | 5.800 | -0.363329 |
---|
3 | 1959-06-30 23:59:59.999999999 | realgdp | 2778.801 | -1.510863 |
---|
4 | 1959-06-30 23:59:59.999999999 | infl | 2.340 | 0.144886 |
---|
... | ... | ... | ... | ... |
---|
604 | 2009-06-30 23:59:59.999999999 | infl | 3.370 | -0.936350 |
---|
605 | 2009-06-30 23:59:59.999999999 | unemp | 9.200 | 1.452993 |
---|
606 | 2009-09-30 23:59:59.999999999 | realgdp | 12990.341 | -0.137653 |
---|
607 | 2009-09-30 23:59:59.999999999 | infl | 3.560 | -0.423567 |
---|
608 | 2009-09-30 23:59:59.999999999 | unemp | 9.600 | 0.086320 |
---|
609 rows × 4 columns
1
| ldata.pivot('date', 'item')
|
| value | value2 |
---|
item | infl | realgdp | unemp | infl | realgdp | unemp |
---|
date | | | | | | |
---|
1959-03-31 23:59:59.999999999 | 0.00 | 2710.349 | 5.8 | -0.194493 | 0.176200 | -0.363329 |
---|
1959-06-30 23:59:59.999999999 | 2.34 | 2778.801 | 5.1 | 0.144886 | -1.510863 | 0.619092 |
---|
1959-09-30 23:59:59.999999999 | 2.74 | 2775.488 | 5.3 | -1.519818 | 1.400742 | -0.669686 |
---|
1959-12-31 23:59:59.999999999 | 0.27 | 2785.204 | 5.6 | -0.027059 | 1.053187 | -1.069302 |
---|
1960-03-31 23:59:59.999999999 | 2.31 | 2847.699 | 5.2 | -1.157763 | -1.174653 | -0.562152 |
---|
... | ... | ... | ... | ... | ... | ... |
---|
2008-09-30 23:59:59.999999999 | -3.16 | 13324.600 | 6.0 | 0.632464 | -0.447070 | 1.267958 |
---|
2008-12-31 23:59:59.999999999 | -8.79 | 13141.920 | 6.9 | -1.030839 | -0.359484 | 0.606553 |
---|
2009-03-31 23:59:59.999999999 | 0.94 | 12925.410 | 8.1 | -1.447697 | 0.711061 | -0.873898 |
---|
2009-06-30 23:59:59.999999999 | 3.37 | 12901.504 | 9.2 | -0.936350 | 0.395591 | 1.452993 |
---|
2009-09-30 23:59:59.999999999 | 3.56 | 12990.341 | 9.6 | -0.423567 | -0.137653 | 0.086320 |
---|
203 rows × 6 columns
1
| ldata.pivot('date', 'item', ['value', 'value2'])
|
| value | value2 |
---|
item | infl | realgdp | unemp | infl | realgdp | unemp |
---|
date | | | | | | |
---|
1959-03-31 23:59:59.999999999 | 0.00 | 2710.349 | 5.8 | -0.194493 | 0.176200 | -0.363329 |
---|
1959-06-30 23:59:59.999999999 | 2.34 | 2778.801 | 5.1 | 0.144886 | -1.510863 | 0.619092 |
---|
1959-09-30 23:59:59.999999999 | 2.74 | 2775.488 | 5.3 | -1.519818 | 1.400742 | -0.669686 |
---|
1959-12-31 23:59:59.999999999 | 0.27 | 2785.204 | 5.6 | -0.027059 | 1.053187 | -1.069302 |
---|
1960-03-31 23:59:59.999999999 | 2.31 | 2847.699 | 5.2 | -1.157763 | -1.174653 | -0.562152 |
---|
... | ... | ... | ... | ... | ... | ... |
---|
2008-09-30 23:59:59.999999999 | -3.16 | 13324.600 | 6.0 | 0.632464 | -0.447070 | 1.267958 |
---|
2008-12-31 23:59:59.999999999 | -8.79 | 13141.920 | 6.9 | -1.030839 | -0.359484 | 0.606553 |
---|
2009-03-31 23:59:59.999999999 | 0.94 | 12925.410 | 8.1 | -1.447697 | 0.711061 | -0.873898 |
---|
2009-06-30 23:59:59.999999999 | 3.37 | 12901.504 | 9.2 | -0.936350 | 0.395591 | 1.452993 |
---|
2009-09-30 23:59:59.999999999 | 3.56 | 12990.341 | 9.6 | -0.423567 | -0.137653 | 0.086320 |
---|
203 rows × 6 columns
pivot 其实就是用 set_index 创建层次化索引, 再用 unstack 重塑
1
| ldata.set_index(['date', 'item']).unstack('item')
|
| value | value2 |
---|
item | infl | realgdp | unemp | infl | realgdp | unemp |
---|
date | | | | | | |
---|
1959-03-31 23:59:59.999999999 | 0.00 | 2710.349 | 5.8 | -0.194493 | 0.176200 | -0.363329 |
---|
1959-06-30 23:59:59.999999999 | 2.34 | 2778.801 | 5.1 | 0.144886 | -1.510863 | 0.619092 |
---|
1959-09-30 23:59:59.999999999 | 2.74 | 2775.488 | 5.3 | -1.519818 | 1.400742 | -0.669686 |
---|
1959-12-31 23:59:59.999999999 | 0.27 | 2785.204 | 5.6 | -0.027059 | 1.053187 | -1.069302 |
---|
1960-03-31 23:59:59.999999999 | 2.31 | 2847.699 | 5.2 | -1.157763 | -1.174653 | -0.562152 |
---|
... | ... | ... | ... | ... | ... | ... |
---|
2008-09-30 23:59:59.999999999 | -3.16 | 13324.600 | 6.0 | 0.632464 | -0.447070 | 1.267958 |
---|
2008-12-31 23:59:59.999999999 | -8.79 | 13141.920 | 6.9 | -1.030839 | -0.359484 | 0.606553 |
---|
2009-03-31 23:59:59.999999999 | 0.94 | 12925.410 | 8.1 | -1.447697 | 0.711061 | -0.873898 |
---|
2009-06-30 23:59:59.999999999 | 3.37 | 12901.504 | 9.2 | -0.936350 | 0.395591 | 1.452993 |
---|
2009-09-30 23:59:59.999999999 | 3.56 | 12990.341 | 9.6 | -0.423567 | -0.137653 | 0.086320 |
---|
203 rows × 6 columns
pd.melt( frame, id_vars, value_vars ) : 将宽格式旋转为长格式
id_vars 为分组指标, value_vars 为合并到一起的列
1 2 3 4 5
| df = pd.DataFrame({'key': ['foo', 'bar', 'baz'], 'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]}) df
|
| key | A | B | C |
---|
0 | foo | 1 | 4 | 7 |
---|
1 | bar | 2 | 5 | 8 |
---|
2 | baz | 3 | 6 | 9 |
---|
1
| pd.melt(df, ['key'], ['A', 'B'])
|
| key | variable | value |
---|
0 | foo | A | 1 |
---|
1 | bar | A | 2 |
---|
2 | baz | A | 3 |
---|
3 | foo | B | 4 |
---|
4 | bar | B | 5 |
---|
5 | baz | B | 6 |
---|
1
| pd.melt(df, ['A', 'B'], ['C', 'key'])
|
| A | B | variable | value |
---|
0 | 1 | 4 | C | 7 |
---|
1 | 2 | 5 | C | 8 |
---|
2 | 3 | 6 | C | 9 |
---|
3 | 1 | 4 | key | foo |
---|
4 | 2 | 5 | key | bar |
---|
5 | 3 | 6 | key | baz |
---|