Pandas - 数据重塑 Pt.1

这篇随笔主要介绍如何利用 Pandas 进行层次化索引、合并数据和重塑数据
Pt.1 部分主要介绍利用 Pandas 进行层次化索引
Pt.2 部分详细介绍利用 Pandas 合并数据
Pt.3 部分主要介绍利用 Pandas 合并数据和重塑数据
Pt.4 部分主要介绍利用 Pandas 重塑数据

1
2
import pandas as pd
import numpy as np

层次化索引

Series 的层次化索引

1
2
3
4
data = pd.Series(np.random.randn(6),
index=[['a', 'a', 'b', 'b', 'c', 'c'],
[1, 2, 1, 3, 2, 3]])
data
1
2
3
4
5
6
7
a  1    0.971509
2 1.042921
b 1 0.316566
3 1.730938
c 2 -0.457496
3 1.183176
dtype: float64
1
data.index
1
2
3
4
5
6
7
MultiIndex([('a', 1),
('a', 2),
('b', 1),
('b', 3),
('c', 2),
('c', 3)],
)

切片

1
data['b']
1
2
3
1    0.316566
3 1.730938
dtype: float64
1
data['b':'c']
1
2
3
4
5
b  1    0.316566
3 1.730938
c 2 -0.457496
3 1.183176
dtype: float64
1
data.loc[['b', 'a']]
1
2
3
4
5
b  1    0.316566
3 1.730938
a 1 0.971509
2 1.042921
dtype: float64

在“内层”中进行切片

1
data.loc[:, 2]
1
2
3
a    1.042921
c -0.457496
dtype: float64

series.unstack( ), frame.stack( ) : 数据重塑, Series → DataFrame, DataFrame → Series

1
2
frame = data.unstack()
frame
123
a0.9715091.042921NaN
b0.316566NaN1.730938
cNaN-0.4574961.183176
1
frame.stack()
1
2
3
4
5
6
7
a  1    0.971509
2 1.042921
b 1 0.316566
3 1.730938
c 2 -0.457496
3 1.183176
dtype: float64

DataFrame 的层次化索引

1
2
3
4
5
6
frame = pd.DataFrame(np.arange(12).reshape((4, 3)),
index=[['a', 'a', 'b', 'b'],
[1, 2, 1, 2]],
columns=[['Ohio', 'Ohio', 'Colorado'],
['Green', 'Red', 'Green']])
frame
OhioColorado
GreenRedGreen
a1012
2345
b1678
291011

names 属性

1
2
3
frame.index.names = ['key1', 'key2']
frame.columns.names = ['state', 'color']
frame
stateOhioColorado
colorGreenRedGreen
key1key2
a1012
2345
b1678
291011

切片方式

1
frame['Ohio'], frame['Ohio']['Green']
1
2
3
4
5
6
7
8
9
10
11
12
(color      Green  Red
key1 key2
a 1 0 1
2 3 4
b 1 6 7
2 9 10,
key1 key2
a 1 0
2 3
b 1 6
2 9
Name: Green, dtype: int32)

frame.set_index( keys, drop ), frame.reset_index( level ) : 将 DataFrame 的列转为索引

1
2
frame2 = frame.reset_index()
frame2
statekey1key2OhioColorado
colorGreenRedGreen
0a1012
1a2345
2b1678
3b291011
1
frame2.set_index(['key1', 'key2'])
stateOhioColorado
colorGreenRedGreen
key1key2
a1012
2345
b1678
291011
1
frame2.set_index(['key1', 'key2'], drop=False)
statekey1key2OhioColorado
colorGreenRedGreen
key1key2
a1a1012
2a2345
b1b1678
2b291011

frame.swaplevel( level1, level2 ), frame.sort_index( level ) : 重排与分级排序

1
frame
stateOhioColorado
colorGreenRedGreen
key1key2
a1012
2345
b1678
291011
1
frame.sort_index(level=1)
stateOhioColorado
colorGreenRedGreen
key1key2
a1012
b1678
a2345
b291011
1
frame.swaplevel('key1', 'key2')
stateOhioColorado
colorGreenRedGreen
key2key1
1a012
2a345
1b678
2b91011
1
frame.swaplevel('key1', 'key2').sort_index(level='key2')
stateOhioColorado
colorGreenRedGreen
key2key1
1a012
b678
2a345
b91011

frame.groupby( axis, level ).sum( ) : 汇总统计

1
frame.groupby(level='key2').sum()
stateOhioColorado
colorGreenRedGreen
key2
16810
2121416
1
frame.groupby(axis=1, level='color').sum()
colorGreenRed
key1key2
a121
284
b1147
22010