Pandas - 语法基础 Pt.2

发表于 2022-07-31 更新于 2024-03-30 分类于 pandas

这篇随笔主要介绍 Pandas 的两种数据类型：Series 和 DataFrame 相关的内容
Pt.1 部分主要介绍对 Pandas 两种数据类型的基本操作，包括创建、索引和修改
Pt.2 部分详细介绍了 Pandas 的索引操作
Pt.3 部分主要介绍了 Pandas 的计算和函数
Pt.4 部分详细介绍了 Pandas 的统计计算相关的函数

1
2
3

import numpy as np
import pandas as pd
from pandas import Series, DataFrame

pandas 的索引对象 —— index

index 对象

Index([ ... ], dtype='object')

1
2
3

obj = pd.Series(range(3), index=['a', 'b', 'c'])
index = obj.index
index, obj.index

1 2	(Index(['a', 'b', 'c'], dtype='object'), Index(['a', 'b', 'c'], dtype='object'))

1	index[1], index[1:]

1	('b', Index(['b', 'c'], dtype='object'))

Index 对象不可进行切片赋值

1	#index[1:] = pd.Index(['d', 'e']) # Error

pd.Index( object )

1
2
3

labels = pd.Index(np.arange(3))
obj2 = pd.Series([1.5, -2.5, 0], index=labels)
obj2.index is labels

True

frame.index , frame.columns

1 2	print(frame3) frame3.index, frame3.columns

state  Nevada  Ohio
year               
2001      2.4   1.7
2002      2.9   3.6
2000      NaN   1.5

(Int64Index([2001, 2002, 2000], dtype='int64', name='year'),
 Index(['Nevada', 'Ohio'], dtype='object', name='state'))

index1.append( index2 ) : 连接 index1 和 index2 , 产生一个新的 Index 对象 [index1, index2]

index1 = pd.Index(list(range(2, 5)))
index2 = pd.Index(list(range(0, 3)))
index3 = index1.append(index2)
index1, index2, index3

1
2
3

(Int64Index([2, 3, 4], dtype='int64'),
 Int64Index([0, 1, 2], dtype='int64'),
 Int64Index([2, 3, 4, 0, 1, 2], dtype='int64'))

index1.difference( index2 ) : 计算两个 Index 的差值 index1-index2, 得到一个 Index

index1 = pd.Index(list(range(2, 5)))
index2 = pd.Index(list(range(0, 3)))
index3 = index1.difference(index2)  # index1-index2
index1, index2, index3

1
2
3

(Int64Index([2, 3, 4], dtype='int64'),
 Int64Index([0, 1, 2], dtype='int64'),
 Int64Index([3, 4], dtype='int64'))

index1.intersection( index2 ) : 计算交集

index1 = pd.Index(list(range(2, 5)))
index2 = pd.Index(list(range(0, 3)))
index3 = index1.intersection(index2)  
index1, index2, index3

1
2
3

(Int64Index([2, 3, 4], dtype='int64'),
 Int64Index([0, 1, 2], dtype='int64'),
 Int64Index([2], dtype='int64'))

index1.union( index2 ) : 计算并集

index1 = pd.Index(list(range(2, 5)))
index2 = pd.Index(list(range(0, 3)))
index3 = index1.union(index2)  
index1, index2, index3

1
2
3

(Int64Index([2, 3, 4], dtype='int64'),
 Int64Index([0, 1, 2], dtype='int64'),
 Int64Index([0, 1, 2, 3, 4], dtype='int64'))

index.isin( object ) : 判断 index 中的各值是否在 object 中

1
2
3

index = pd.Index(list(range(2, 5)))
obj = list(range(4))
index, obj, index1.isin(obj)

1
2
3

(Int64Index([2, 3, 4], dtype='int64'),
 [0, 1, 2, 3],
 array([ True,  True, False]))

index.delete( loc ) : 删除 index 中 loc 处的元素, 并返回一个新的索引

1
2
3

index = pd.Index(list(range(2, 5)))
index2 = index.delete(1)
index, index2

1	(Int64Index([2, 3, 4], dtype='int64'), Int64Index([2, 4], dtype='int64'))

index.drop( labels ) : 删除传入的值 labels , 并返回一个新的索引

1
2
3

index = pd.Index(list(range(2, 5)))
index2 = index.drop([2, 3])
index, index2

1	(Int64Index([2, 3, 4], dtype='int64'), Int64Index([4], dtype='int64'))

index.insert( loc, item ) : 将元素 item 插入到位置 loc 处 , 并返回一个新的索引

1
2
3

index = pd.Index(list(range(2, 5)))
index2 = index.insert(2, 2)
index, index2

1	(Int64Index([2, 3, 4], dtype='int64'), Int64Index([2, 3, 2, 4], dtype='int64'))

index.is_monotonic : 判断 index 是否是升序排列

1 2	index = pd.Index(list(range(2, 5))) index, index.is_monotonic

1	(Int64Index([2, 3, 4], dtype='int64'), True)

index.is_unique : 判断 index 是否有重复值

1 2	index = pd.Index(list(range(2, 5))) index, index.is_unique

1	(Int64Index([2, 3, 4], dtype='int64'), True)

index.unique( ) : 计算 index 中唯一值的数组

1 2	index = pd.Index([1, 1, 2, 2, 3, 3, 5]) index, index.unique()

1 2	(Int64Index([1, 1, 2, 2, 3, 3, 5], dtype='int64'), Int64Index([1, 2, 3, 5], dtype='int64'))

重新索引 : reindex

obj.reindex( index ) : 根据新索引重新排列

1
2
3

obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
obj, obj2

(d    4.5
 b    7.2
 a   -5.3
 c    3.6
 dtype: float64,
 a   -5.3
 b    7.2
 c    3.6
 d    4.5
 e    NaN
 dtype: float64)

obj.reindex( index, method ) : 填充方式

method = 'ffill' : 前向值填充

1 2	obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4]) obj3, obj3.reindex(range(7), method='ffill')

(0      blue
 2    purple
 4    yellow
 dtype: object,
 0      blue
 1      blue
 2    purple
 3    purple
 4    yellow
 5    yellow
 6    yellow
 dtype: object)

method = 'bfill' : 后向值填充

1	obj3.reindex(range(7), method='bfill')

0      blue
1    purple
2    purple
3    yellow
4    yellow
5       NaN
6       NaN
dtype: object

frame.reindex( index, columns ) : 会重新索引行和列, 其中列用 columns 重新索引

frame = pd.DataFrame(np.arange(9).reshape((3, 3)),
                     index=['a', 'c', 'd'],
                     columns=['Ohio', 'Texas', 'California'])
frame

	Ohio	Texas	California
a	0	1	2
c	3	4	5
d	6	7	8

1	frame.reindex(['a', 'b', 'c', 'd'])

	Ohio	Texas	California
a	0.0	1.0	2.0
b	NaN	NaN	NaN
c	3.0	4.0	5.0
d	6.0	7.0	8.0

1	frame.reindex(['a', 'b', 'c', 'd'], columns=['California', 'Los', 'Ohio', 'Texas'])

	California	Los	Ohio	Texas
a	2.0	NaN	0.0	1.0
b	NaN	NaN	NaN	NaN
c	5.0	NaN	3.0	4.0
d	8.0	NaN	6.0	7.0

frame.reindex( ..., fill_value ) : 设置缺失值

frame = pd.DataFrame(np.arange(9).reshape((3, 3)),
                     index=['a', 'c', 'e'],
                     columns=['Ohio', 'Texas', 'California'])
frame.reindex(index     = ['a', 'b', 'c', 'd', 'e'],
              columns   = ['California', 'Los', 'Ohio', 'Texas'],  
              fill_value= -9999)

	California	Los	Ohio	Texas
a	2	-9999	0	1
b	-9999	-9999	-9999	-9999
c	5	-9999	3	4
d	-9999	-9999	-9999	-9999
e	8	-9999	6	7

frame.reindex( ..., method, limit ) : 前向或后向填充时的最大填充量

frame = pd.DataFrame(np.arange(9).reshape((3, 3)),
                     index=['a', 'd', 'e'],
                     columns=['Ohio', 'Texas', 'California'])
frame.reindex(index = ['a', 'b', 'c', 'd', 'e'],
              method= 'ffill', 
              limit = 1)

	Ohio	Texas	California
a	0.0	1.0	2.0
b	0.0	1.0	2.0
c	NaN	NaN	NaN
d	3.0	4.0	5.0
e	6.0	7.0	8.0

丢弃指定轴 : drop

obj.drop( index ) : 删除指定的索引行

1 2	obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e']) obj, obj.drop(index=['b', 'd'])

(a    0.0
 b    1.0
 c    2.0
 d    3.0
 e    4.0
 dtype: float64,
 a    0.0
 c    2.0
 e    4.0
 dtype: float64)

frame = pd.DataFrame(np.arange(16).reshape((4, 4)), 
                     index=['Ohio', 'Colorado', 'Utah', 'New York'], 
                     columns=['one', 'two', 'three', 'four'])
frame

	one	two	three	four
Ohio	0	1	2	3
Colorado	4	5	6	7
Utah	8	9	10	11
New York	12	13	14	15

frame.drop( labels, axis ) : 删除指定轴 axis 的索引项 labels

1	frame.drop(labels=['Colorado', 'Ohio'], axis='index')

	one	two	three	four
Utah	8	9	10	11
New York	12	13	14	15

1	frame.drop(labels=['two', 'four'], axis='columns')

	one	three
Ohio	0	2
Colorado	4	6
Utah	8	10
New York	12	14

frame.drop( index, columns ) : 删除指定的索引行 index 和索引列 columns

1	frame.drop(index=['Colorado'], columns=['two'])

	one	three	four
Ohio	0	2	3
Utah	8	10	11
New York	12	14	15

frame.drop( ..., inplace ) : inplace 设置为 True 时就地修改对象 frame

1
2
3

print(obj)
obj.drop('c', inplace=True)
print(obj)

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64
a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

切片 : obj[ ... ], loc 和 iloc, 整数索引

直接索引 : obj [ ... ]

Series :

1 2	obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd']) obj[ ['b', 'a'] ]

1
2
3

b    1.0
a    0.0
dtype: float64

1	obj[ 2:4 ]

1
2
3

c    2.0
d    3.0
dtype: float64

1	obj[ [1, 3] ]

1
2
3

b    1.0
d    3.0
dtype: float64

1	obj['b':'c'] # 包含末端

1
2
3

b    1.0
c    2.0
dtype: float64

1	obj[ obj<2 ]

1
2
3

a    0.0
b    1.0
dtype: float64

DataFrame :

frame = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                    columns=['one', 'two', 'three', 'four'])
frame

	one	two	three	four
Ohio	0	1	2	3
Colorado	4	5	6	7
Utah	8	9	10	11
New York	12	13	14	15

传入单一的元素或者列表可以索引列

1	frame[ ['three', 'one'] ] # 索引列

	three	one
Ohio	2	0
Colorado	6	4
Utah	10	8
New York	14	12

传入整数可以索引行, 但是不能单独索引（需要有冒号 : ）

1	frame[ 2:3 ] #索引行

	one	two	three	four
Utah	8	9	10	11

布尔型索引

1	frame[ frame['three'] > 6 ]

	one	two	three	four
Utah	8	9	10	11
New York	12	13	14	15

利用 loc 和 iloc 进行索引

frame.iloc( index_number, columns_number )
frame.loc( index_labels, columns_labels )

frame = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                    columns=['one', 'two', 'three', 'four'])
frame

	one	two	three	four
Ohio	0	1	2	3
Colorado	4	5	6	7
Utah	8	9	10	11
New York	12	13	14	15

1	frame.iloc[2:4, [3, 0, 1]] # 选取第2、3行，第3、0、1列

	four	one	two
Utah	11	8	9
New York	15	12	13

1	frame.loc['Utah':'New York', ['four', 'one', 'two']]

	four	one	two
Utah	11	8	9
New York	15	12	13

这两个索引函数也适用于一个标签或 多个标签 的切片

1	frame.iloc[:, :][frame.three > 5]

	one	two	three	four
Colorado	4	5	6	7
Utah	8	9	10	11
New York	12	13	14	15

数据索引方式的总结

类型	说明
df[ col_labs ]	通过标签 col_labs , 选取对应的列
df.loc[ ind_labs ]	通过标签 ind_labs , 选取对应的行
df.loc[ :, col_labs ]	通过标签 col_labs , 选取对应的行
df.loc[ ind_labs, col_labs ]	通过标签, 选取对应的行 ind_labs , 列 col_labs
df.iloc[ ind_nums ]	通过整数 ind_nums , 选取对应的行
df.iloc[ :, col_nums ]	通过整数 col_nums , 选取对应的列
df.iloc[ ind_nums, col_nums ]	通过整数, 选取对应的行 ind_nums , 列 col_nums
df.at[ ind_lab, col_lab ]	通过行列的标签, 选取单一的标量
df.iat[ i, j ]	通过行列的位置（整数）, 选取单一的标量
reindex	通过行列标签, 重塑索引
get_value, set_value	通过行列标签, 选取单一值