Pandas - 数据清洗 Pt.1

这篇随笔主要介绍如何利用 Pandas 对数据进行清洗
Pt.1 部分主要介绍利用 Pandas 处理缺失数据
Pt.2 部分详细介绍利用 Pandas 处理重复数据、替换数据和划分数据
Pt.3 部分主要介绍利用 Pandas 与 正则表达式的结合

1
2
import pandas as pd
import numpy as np

处理缺失数据

pandas 中缺失值的表示 : NaN, None

obj.isnull( ), obj.isna( )

1
2
string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])
string_data, string_data.isnull()
1
2
3
4
5
6
7
8
9
10
(0     aardvark
1 artichoke
2 NaN
3 avocado
dtype: object,
0 False
1 False
2 True
3 False
dtype: bool)
1
2
string_data[0] = None
string_data, string_data.isnull()
1
2
3
4
5
6
7
8
9
10
(0         None
1 artichoke
2 NaN
3 avocado
dtype: object,
0 True
1 False
2 True
3 False
dtype: bool)

obj.notnull( ), obj.notna( )

1
string_data.notna()
1
2
3
4
5
0    False
1 True
2 False
3 True
dtype: bool

series.dropna( ), frame.dropna( axis, how, thresh )

series.dropna( ) 等价于 series[ series.notna( ) ]

1
string_data, string_data.dropna()
1
2
3
4
5
6
7
8
(0         None
1 artichoke
2 NaN
3 avocado
dtype: object,
1 artichoke
3 avocado
dtype: object)
1
string_data[string_data.notna()]
1
2
3
1    artichoke
3 avocado
dtype: object

how : 过滤缺失值的方式

1
2
3
data = pd.DataFrame([[1., 6.5, 3.], [1., np.nan, np.nan], 
[np.nan, np.nan, np.nan], [np.nan, 6.5, 3.]])
data
012
01.06.53.0
11.0NaNNaN
2NaNNaNNaN
3NaN6.53.0
1
data.dropna(how='any') # 去除的行至少有一个缺失值
012
01.06.53.0
1
data.dropna(how='all') # 去除的行所有的值都是缺失值
012
01.06.53.0
11.0NaNNaN
3NaN6.53.0

thresh = n : 过滤缺失值时, 对应行或列的剩下的非缺失值的个数大于等于n

1
2
3
4
5
6
7
8
9
df = pd.DataFrame(np.random.randn(7,7))
df.iloc[:7, 0] = np.nan
df.iloc[:6, 1] = np.nan
df.iloc[:5, 2] = np.nan
df.iloc[:4, 3] = np.nan
df.iloc[:3, 4] = np.nan
df.iloc[:2, 5] = np.nan
df.iloc[:1, 6] = np.nan
df
0123456
0NaNNaNNaNNaNNaNNaNNaN
1NaNNaNNaNNaNNaNNaN0.548996
2NaNNaNNaNNaNNaN-0.448875-0.028554
3NaNNaNNaNNaN1.973543-0.927488-0.622286
4NaNNaNNaN1.329444-0.430934-0.9573481.737727
5NaNNaN-0.497778-0.801027-0.138910-0.5993820.839175
6NaN-0.6386121.4379670.1084820.436201-0.003544-0.953735
1
2
df.dropna(thresh=1) 
# 保留的行中, 至少有一个不是缺失值, 即去除的行所有值都是缺失值
0123456
1NaNNaNNaNNaNNaNNaN0.548996
2NaNNaNNaNNaNNaN-0.448875-0.028554
3NaNNaNNaNNaN1.973543-0.927488-0.622286
4NaNNaNNaN1.329444-0.430934-0.9573481.737727
5NaNNaN-0.497778-0.801027-0.138910-0.5993820.839175
6NaN-0.6386121.4379670.1084820.436201-0.003544-0.953735
1
df.dropna(thresh=3) # 保留的行中, 至少有一个不是缺失值
0123456
3NaNNaNNaNNaN1.973543-0.927488-0.622286
4NaNNaNNaN1.329444-0.430934-0.9573481.737727
5NaNNaN-0.497778-0.801027-0.138910-0.5993820.839175
6NaN-0.6386121.4379670.1084820.436201-0.003544-0.953735

series.fillna( value, method ), frame.fillna( value, method, axis )

1
string_data.fillna(-9999)
1
2
3
4
5
0        -9999
1 artichoke
2 -9999
3 avocado
dtype: object

value = dict : 通过传递字典到 fillna 可以实现对不同的列填充不同的值

1
df
0123456
0NaNNaNNaNNaNNaNNaNNaN
1NaNNaNNaNNaNNaNNaN0.548996
2NaNNaNNaNNaNNaN-0.448875-0.028554
3NaNNaNNaNNaN1.973543-0.927488-0.622286
4NaNNaNNaN1.329444-0.430934-0.9573481.737727
5NaNNaN-0.497778-0.801027-0.138910-0.5993820.839175
6NaN-0.6386121.4379670.1084820.436201-0.003544-0.953735
1
df.fillna( {1: 0.5, 2: 0} )
0123456
0NaN0.5000000.000000NaNNaNNaNNaN
1NaN0.5000000.000000NaNNaNNaN0.548996
2NaN0.5000000.000000NaNNaN-0.448875-0.028554
3NaN0.5000000.000000NaN1.973543-0.927488-0.622286
4NaN0.5000000.0000001.329444-0.430934-0.9573481.737727
5NaN0.500000-0.497778-0.801027-0.138910-0.5993820.839175
6NaN-0.6386121.4379670.1084820.436201-0.003544-0.953735

method : 填充方式 , limit : 限制填充个数

1
2
3
4
df = pd.DataFrame(np.random.randn(6, 3))
df.iloc[1:5, 1] = np.nan
df.iloc[2:4, 2] = np.nan
df
012
0-0.2912620.2008772.589814
1-0.337112NaN0.186565
2-2.075923NaNNaN
3-0.109867NaNNaN
40.807568NaN-0.173979
50.0592230.8186210.742860

method = 'ffill' : 用前一个非缺失值去填充该缺失值

1
df.fillna(method='ffill', limit=2)
012
0-0.2912620.2008772.589814
1-0.3371120.2008770.186565
2-2.0759230.2008770.186565
3-0.109867NaN0.186565
40.807568NaN-0.173979
50.0592230.8186210.742860

method = 'bfill' : 用后一个非缺失值去填充该缺失值

1
df.fillna(method='bfill', limit=2)
012
0-0.2912620.2008772.589814
1-0.337112NaN0.186565
2-2.075923NaN-0.173979
3-0.1098670.818621-0.173979
40.8075680.818621-0.173979
50.0592230.8186210.742860