代码收藏家技术教程 2024-01-19

Python Pandas库基础使用指南

Python开发实用教程

Pandas 是基于NumPy 的一种工具，该工具是为解决数据分析任务而创建的。与NumPy十分类似的一点是，NumPy的核心是提供了数组结构，而Pandas 的核心是提供了两种数据结构： Series（一维数据）与 DataFrame（二维数据），特别是DataFrame，可以让开发人员可以像Excel一样灵活、方便的操作二维表格数据。

基本数据结构

Series

Series 是带标签的一维数组，可存储整数、浮点数、字符串、Python 对象等类型的数据。轴标签统称为索引。它与此前学习的命名元组（collections.namedtuple）十分的相似。

Series的创建

调用 pd.Series 函数即可创建 Series：

import pandas as pd

s=pd.Series( data, index, dtype, copy)

data 支持以下数据类型：

列表

Python 字典

多维数组

标量值（如，5）

index 是轴标签列表。不同数据可分为以下几种情况：

data 是多维数组时，index 长度必须与 data 长度一致。没有指定 index 参数时，创建数值型索引，即 [0, ..., len(data) - 1]。

data 为字典，且未设置 index 参数时，如果 Python 版本 >= 3.6 且 Pandas 版本 >= 0.23，Series 按字典的插入顺序排序索引；Python < 3.6 或 Pandas < 0.23，且未设置 index 参数时，Series 按字母顺序排序字典的键（key）列表。如果设置了 index 参数，则按索引标签提取 data 里对应的值。

data 是标量值时，必须提供索引。Series 按索引长度重复该标量值。

dtype表示数据类型，如果没有提供，则会自动判断得出。

copy表示对 data 进行拷贝，默认为 False。

import numpy as np
import pandas as pd

s = pd.Series([10,20,30])
print(s)
‘’'
0    10
1    20
2    30
dtype: int64
‘’'

s = pd.Series({'Name':'John', 'Age':10, 'Score':98})
print(s)
‘’'
Name     John
Age        10
Score      98
dtype: object
‘''

s = pd.Series(5, index=['First', 'Second', 'Third'])
print(s)
‘’'
First     5
Second    5
Third     5
dtype: int64
‘''

s = pd.Series(np.asarray(5), index=['a', 'b', 'c', 'd', 'e'])
print(s)

‘’'
a    5
b    5
c    5
d    5
e    5
dtype: int64
‘’'

print(s.array)
‘’'
<NumpyExtensionArray>
[5, 5, 5, 5, 5]
Length: 5, dtype: int64
‘’'

从上面的输出可以看出，Series也是支持dtype的，实际也可以通过属性array访问到Series的数组，Pandas使用的是基于NumPy类型的扩展数组。

访问Series的数据

Series的数据可以通过两种方式访问：位置索引访问、索引标签访问。

s = pd.Series([1,2,3,4,5], index=['a', 'b', 'c', 'd', 'e'])
print(s[0]) #1
print(s[-1])#5
print(s['b']) #2

上面的例子如果使用位置索引时会有警告：FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`

如果不指定index就可以直接使用位置索引。

Series也支持负数索引，与NumPy的数组是一样的。

Series也支持切片：

s = pd.Series([1,2,3,4,5])
print(s[0]) #1
print(s[2:3]) #2 3
print(s[::2]) #1 3 5

‘’'
1
2    3
dtype: int64
0    1
2    3
4    5
dtype: int64
‘''

使用索引标签访问多个元素值，需要把标签放在二位数组里：

s = pd.Series([1,2,3,4,5], index=['a', 'b', 'c', 'd', 'e'])
print(s[['b', 'c', 'a']])

‘’'
b    2
c    3
a    1
dtype: int64
‘''

Series常用属性

名称	属性
axes	以列表的形式返回所有行索引标签。
dtype	返回对象的数据类型。
empty	返回一个布尔值，用于判断数据对象是否为空。
ndim	查看序列的维数。根据定义，Series 是一维数据结构，因此它始终返回 1。
size	返回输入数据的元素数量。
values	以 ndarray 的形式返回 Series 对象。
array	返回NumPy的数组对象
index	返回一个RangeIndex对象，用来描述索引的取值范围。
iloc[…]	下标访问元素
hasnans	返回是否有空元素（NaN）
is_unique	返回s中的值是不是都是唯一的，如果是返回True
is_monotonic_increasing	如果s中的值是单调增长的，返回True
is_monotonic_decreasing	如果s中的值是单调递减的，返回True

s = pd.Series([1,2,3,4,5], index=['a', 'b', 'c', 'd', 'e'])
print(f'{s.axes=},{s.dtype=},{s.ndim=},{s.empty=}, {s.size=}')
print(f'{s.values=}')
print(f'{s.array=}')
print(f'{s.index=}')
print(f'{s.shape=}’)
print(f'{s.hasnans=}')

‘’'
s.axes=[Index(['a', 'b', 'c', 'd', 'e'], dtype='object')],s.dtype=dtype('int64'),s.ndim=1,s.empty=False, s.size=5
s.values=array([1, 2, 3, 4, 5])
s.array=<NumpyExtensionArray>
[1, 2, 3, 4, 5]
Length: 5, dtype: int64
s.index=Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
s.shape=(5,)
s.hasnans=False
‘''

s = pd.Series([1,2,3,4,5], index=['a', 'b', 'c', 'd', 'e'])
print(s.iloc[0]) #1
print(s.iloc[2:])
print(s.iloc[::-1])

‘’'
1
c    3
d    4
e    5
dtype: int64
e    5
d    4
c    3
b    2
a    1
dtype: int64
‘''

#s中的值全部是唯一的
>>>s = pd.Series([1, 2, 3])
>>>s.is_unique
True

>>>s = pd.Series([1, 2, 3, 1])
>>>s.is_unique
False

#s中的值是否为单调增长
>>>s = pd.Series([1, 2, 2])
>>>s.is_monotonic_increasing
True

>>>s = pd.Series([3, 2, 1])
>>>s.is_monotonic_increasing
False

#s中的值是否为单调减少
>>>s = pd.Series([3, 2, 2, 1])
>>>s.is_monotonic_decreasing
True

>>>s = pd.Series([1, 2, 3])
>>>s.is_monotonic_decreasing
False

Series支持的运算

运算	说明
s.add(other[, level, fill_value, axis])	s+other
s.sub(other[, level, fill_value, axis])	s-other
s.mul(other[, level, fill_value, axis])	s*other
s.div(other[, level, fill_value, axis])	s/other
s.truediv(other[, level, fill_value, axis])	s/other
s.floordiv(other[, level, fill_value, axis])	s//other
s.mod(other[, level, fill_value, axis])	s%other
s.pow(other[, level, fill_value, axis])	s**other
s.radd(other[, level, fill_value, axis])	s+other
s.rsub(other[, level, fill_value, axis])	s-other
s.rmul(other[, level, fill_value, axis])	s*other
s.rdiv(other[, level, fill_value, axis])	s/other
s.rtruediv(other[, level, fill_value, axis])	s/other
s.rfloordiv(other[, level, fill_value, axis])	s//other
s.rmod(other[, level, fill_value, axis])	s%other
s.rpow(other[, level, fill_value, axis])	s**other
s.combine(other, func[, fill_value])	分别对s、other的每对元素调用func，返回的结果为func返回的结果得到的Series。
s.combine_first(other])	使用other填充s对应的空值
s.round(decimals=0, args, *kwargs)	每个元素四舍五入
s.lt(other[, level, fill_value, axis])	s<other
s.gt(other[, level, fill_value, axis])	s>other
s.le(other[, level, fill_value, axis])	s<=other
s.ge(other[, level, fill_value, axis])	s>=other
s.ne(other[, level, fill_value, axis])	s!=other
s.eq(other[, level, fill_value, axis])	s==other
s.product(axis=None, skipna=True, numeric_only=False, min_count=0, **kwargs）	所有元素的乘积 skipna：是否跳过空值;numeric_only仅数字;min_count最少几个数
s.dot(other)	两个Series求笛卡尔积
s.abs()	对每个元素求绝对值

简单运算符举例

import pandas as pd

s1 = pd.Series([1,2,3,4,5])
s2 = pd.Series([10,20,30,40,50])
s3 = s2 - s1
print(f's3=s2 - s1,s3:\n', s3)
s4 = s2.sub(s1) #sub和-实际是等效果的
print(f's4=s2.sub(s1),s4:\n',s4)
print(f's3 == s4 :\n', s3 == s4)

‘’'
s3=s2 - s1,s3:
 0     9
1    18
2    27
3    36
4    45
dtype: int64
s4=s2.sub(s1),s4:
 0     9
1    18
2    27
3    36
4    45
dtype: int64
s3 == s4 :
 0    True
1    True
2    True
3    True
4    True
dtype: bool
‘''

组合调用函数combine

import numpy as np
import pandas as pd
import operator

s1 = pd.Series([11,12,33,24,51])
s2 = pd.Series([10,20,30,40,50])
s5 = s1.combine(s2, max)
print('s5 = s1.combine(s2, max)\n', s5)

s6 = s1.combine(s2, operator.add) #func接受两个参数
print('s6 = s1.combine(s2, operator.add)\n', s6)

#测试fill_value
s3 = pd.Series([10,20,30,None,50])
print('s3:\n', s3)
s4 = s1.combine(s3, max) #有没有fill_value=0，max都可以处理
print('s4 = s1.combine(s3, max):\n', s4)
s4 = s1.combine(s3, max, fill_value=0)
print('s4 = s1.combine(s3, max, fill_value=0):\n', s4)

s7 = s1.combine(s3, operator.add, fill_value=0) #add处理不了
print('s7 = s1.combine(s3,operator.add, fill_value=0)\n', s7)

def foo(*args):
    print(f'foo {args=}')
    return 0
s7 = s1.combine(s3,foo, fill_value=0) #可以看到传入func的参数None并未被替换为0
print('s7 = s1.combine(s3,foo, fill_value=0)\n', s7)


‘’'
s5 = s1.combine(s2, max)
 0    11
1    20
2    33
3    40
4    51
dtype: int64
s6 = s1.combine(s2, operator.add)
 0     21
1     32
2     63
3     64
4    101
dtype: int64
s3:
 0    10.0
1    20.0
2    30.0
3     NaN
4    50.0
dtype: float64
s4 = s1.combine(s3, max):
 0    11
1    20
2    33
3    24
4    51
dtype: int64
s4 = s1.combine(s3, max, fill_value=0):
 0    11
1    20
2    33
3    24
4    51
dtype: int64
s7 = s1.combine(s3,operator.add, fill_value=0)
 0     21.0
1     32.0
2     63.0
3      NaN
4    101.0
dtype: float64
foo args=(11, 10.0)
foo args=(12, 20.0)
foo args=(33, 30.0)
foo args=(24, nan)
foo args=(51, 50.0)
s7 = s1.combine(s3,foo, fill_value=0)
 0    0
1    0
2    0
3    0
4    0
dtype: int64

‘''

填充空值combine_first

import pandas as pd

s1 = pd.Series([10,None,30,None,50])
s2 = pd.Series([1,2,None,4,5])
s3 = s1.combine_first(s2)
print(s3)

’’’
0    10.0
1     2.0
2    30.0
3     4.0
4    50.0
dtype: float64
‘’‘

连乘product

import pandas as pd


s1 = pd.Series([1,2,3,4,5])
print(s1.product()) #120
s2 = pd.Series([1,2,3,None,5])
print(s2.product()) #15
print(s2.product(skipna=False)) #nan
print(s2.product(skipna=False, min_count=1)) #nan

Series支持的其他方法

运算	说明
s.abs()
s.all([axis, bool_only, skipna])
s.any(*[, axis, bool_only, skipna])
s.autocorr([lag])
s.between(left, right[, inclusive])
s.clip([lower, upper, axis, inplace])
s.corr(other[, method, min_periods])
s.count()
s.cov(other[, min_periods, ddof])
s.cummax([axis, skipna])
s.cummin([axis, skipna])
s.cumprod([axis, skipna])
s.cumsum([axis, skipna])
s.describe([percentiles, include, exclude])
s.diff([periods])
s.factorize([sort, use_na_sentinel])
s.kurt([axis, skipna, numeric_only])
s.max([axis, skipna, numeric_only])
s.mean([axis, skipna, numeric_only])
s.median([axis, skipna, numeric_only])
s.min([axis, skipna, numeric_only])
s.mode([dropna])
s.nlargest([n, keep])
s.pct_change([periods, fill_method, …])
s.prod([axis, skipna, numeric_only, …])
s.quantile([q, interpolation])
s.rank([axis, method, numeric_only, …])
s.sem([axis, skipna, ddof, numeric_only])
s.skew([axis, skipna, numeric_only])
s.std([axis, skipna, ddof, numeric_only])
s.sum([axis, skipna, numeric_only, …])
s.var([axis, skipna, ddof, numeric_only])
s.kurtosis([axis, skipna, numeric_only])
s.unique()
s.nunique([dropna])
s.value_counts([normalize, sort, …])

Series常用方法

Series提供的方法非常多，这里列举了一些常见的方法

元素查询方法

方法名	说明
s.head(n)	返回前 n 行数据，默认返回前 5 行数据
s.tail(n)	返回后 n 行数据，默认返回后 5 行数据
pd.isnull(s)	检测 Series 中的缺失值，如果有值不存在或缺失（NaN），返回True
pd.notnull(s)	检测 Series 中的缺失值，如果有值不存在或缺失（NaN），返回False
s.get(key[,default])	通过索引获取值
s.at[index]	通过索引访问值
s.iat[iloc]	通过整数索引访问值
s.loc[index]	通过索引访问值
s.iloc[iloc]	通过整数索引访问值
s.__iter__()	返回元素的迭代器
s.items()	返回(index,value)的zip对象，可以通过list转化为列表
s.keys()	返回index对象
s.isin(values)	逐个检查s中的元素，看是否在values中，得到一个新的bool的Series
s.where(cond[, other, inplace, axis, level])	按条件查询，如果条件为假，可以使用other取代
s.mask(cond[, other, inplace, axis, level])	按条件查询，如果条件为真，可以使用other取代
s.filter([items, like, regex, axis])	按索引过滤

import pandas as pd

s1 = pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e'])
print(s1.get('b')) #2
print(s1.at['c']) #3
print(s1.iat[4]) #5
print(s1.iat[-2]) #4
print(s1.loc['a']) #1
print(s1.iloc[3]) #4
print(list(s1.items())) #[('a', 1), ('b', 2), ('c', 3), ('d', 4), ('e', 5)]
print(s1.keys()) #Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
bv = s1.pop('b')
print(f's1.pop(b)后的s1:{bv=}\n', s1)


‘’'
2
3
5
4
1
4
[('a', 1), ('b', 2), ('c', 3), ('d', 4), ('e', 5)]
Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
s1.pop(b)后的s1:bv=2
 a    1
c    3
d    4
e    5
dtype: int64
‘''

import pandas as pd

s1 = pd.Series([10,20,30,40,50], index=['A', 'B', 'C', 'D', 'E'])
print("s1>30:\n", s1.where(s1>30))

’’’
s1>30:
 A     NaN
B     NaN
C     NaN
D    40.0
E    50.0
dtype: float64
‘’‘

import pandas as pd

s1 = pd.Series([10,20,30,40,50], index=['A', 'B', 'C', 'D', 'E'])

print("s1>30:\n", s1.where(s1>30))
print("s1>30,other=[1]:\n", s1.where(s1>30, other=[1]))
print("s1>30,other=[1]:\n", s1.mask(s1>30, other=[1]))

print("s1.filter(items=['A', 'B']):\n", s1.filter(items=['A', 'B']))
print("s1.filter(regex=['ABC']:\n", s1.filter(regex="['ABC']”))

‘’'
s1>30:
 A     NaN
B     NaN
C     NaN
D    40.0
E    50.0
dtype: float64
s1>30,other=[1]:
 A     1
B     1
C     1
D    40
E    50
dtype: int64
s1>30,other=[1]:
 A    10
B    20
C    30
D     1
E     1
dtype: int64
s1.filter(items=['A', 'B']):
 A    10
B    20
dtype: int64
s1.filter(regex=['ABC']:
 A    10
B    20
C    30
dtype: int64
‘''

复制和类型变换

方法名	说明
s.copy(deep=True)	深拷贝，返回一个复制的Series，如果deep=False将得到一个浅拷贝
s.to_list()	将Series转换为list结构返回
s.apply(func[, convert_dtype, args, by_row])	对Series的每个值调用func函数
s.astype(dtype, copy=None, errors='raise')	将s元素的类型进行变换为dtype指定的类型
s.to_numpy(dtype=None, copy=False, na_value=_NoDefault.no_default, **kwargs)	将s转化为NumPy数组
s.__array__(dtype=None)	返回s底层的NumPy数组，如果改变了NumPy数组，s的元素值也会变化
s.to_pickle(path[, compression, …])	将s序列化写入文件
s.to_csv([path_or_buf, sep, na_rep, …])	将s写入csv文件
s.to_dict([into])	将s转为dict
s.to_excel(excel_writer[, sheet_name, …])	将s写入excel文件
s.to_frame([name])	将s转换为DataFrame
s.to_xarray()	将s转换为xarray对象
s.to_hdf(path_or_buf, key[, mode, …])	将s写入为HDFS文件
s.to_sql(name, con, *[, schema, …])	将s转为为sql语句
s.to_json([path_or_buf, orient, …])	将s转化为json对象
s.to_string([buf, na_rep, …])	将s转化为string对象
s.to_clipboard([excel, sep])	拷贝s对象到系统剪切板
s.to_latex([buf, columns, header, …])	转换s为LaTeX
s.to_markdown([buf, mode, index, …])	转换s为MarkDown

import pandas as pd

s1 = pd.Series([1, 2], dtype='int32')
print('s1:\n', s1)
s2 = s1.astype('float32')
print('s2:\n',s2)
s3 = s1.astype('int16', copy=False)
print('s3:\n',s3)
s3[0] = 10
print('修改s3后的s1:\n',s1)

‘’'
s1:
 0    1
1    2
dtype: int32
s2:
 0    1.0
1    2.0
dtype: float32
s3:
 0    1
1    2
dtype: int16
修改s3后的s1:
 0    1
1    2
dtype: int32
‘''

import pandas as pd

s1 = pd.Series([1, 2, 3, 4, 5])
a = s1.to_numpy()
print(f'{type(a)=}',a)

a2 = s1.__array__()
a2[1] = 10
print(s1)

‘’'
type(a)=<class 'numpy.ndarray'> [1 2 3 4 5]

0     1
1    10
2     3
3     4
4     5
dtype: int64
‘''

s1 = pd.Series([None, None, 3, 4, None], index=['A', 'B', 'C', 'D', 'E'])
print(s1.to_dict())

#{'A': nan, 'B': nan, 'C': 3.0, 'D': 4.0, 'E': nan}

排序

方法名	说明
s.argsort([axis, kind, order])	返回排序的整数下标，一个新的Series
s.argmin([axis, skipna])	返回最小值的下标位置，多个只返回第一个
s.argmax([axis, skipna])	返回最大值的下标位置，多个只返回第一个
s.sort_values(*[, axis, ascending, …])	按值进行排序，将得到一个新的Series
s.sort_index(*[, axis, level, …])	按索引排序，将得到一个新的Series
s.reorder_levels(order)	s有多个索引的情况，重组索引的排列顺序，order是重新组织的索引序号的排列
s.swaplevel([i, j, copy])	s有多个索引的情况，交换索引，i，j是索引的序号
s.unstack([level, fill_value, sort])	s有多个索引的情况，将s转换为DataFrame，level指定的是索引序号
s.explode([ignore_index])	把有复杂元素的Series拉平成一维的Series
s.searchsorted(value[, side, sorter])	在已排序的Series中插入value，返回value应该插入的下标位置，如果Series不是已经排序好的，可能会找到第一个认为合适的位置
s.ravel([order])	返回底层数组
s.repeat(repeats[, axis])	循环复制s的元素repeats次，得到一个新的Series
s.view([dtype])	创建一个s的视图

import pandas as pd

s1 = pd.Series([5,4,3,2,1], index=['A','B','C','D','E'])
s2 = s1.argsort()
print('s2 = s1.argsort():\n', s2)

s1 = pd.Series([5,5,3,1,1], index=['A','B','C','D','E'])
print(f'{s1.argmin()=}')
print(f'{s1.argmax()=}')

s1 = pd.Series([5,5,3,1,1], index=['A','B','A','B','A'])
print('s1.sort_values():\n', s1.sort_values())
print('s1.sort_index():\n', s1.sort_index())

s2 = pd.Series([5,5,3,1,1], index=[['A','B','A','B','A'],['S5','S2','S4','S3','S1'],['C1','C2','C3','C4','C5']])
print('s2.reorder_levels([1,0,2]:\n', s2.reorder_levels([1,0,2]))
print('s2.swaplevel(0):\n', s2.swaplevel(0))
print('s2.swaplevel(1,2):\n', s2.swaplevel(1,2))
print('s2.unstack(level=1,fill_value=0):\n', s2.unstack(level=1,fill_value=0))

s3 = pd.Series([[1,2,3], 'foo', [], [5,6]])
print(f's3:\n{s3}')
print('s3.explode():\n', s3.explode())

s1 = pd.Series([1,2,3,4,5],index=['A','B','A','B','A'])
print(f'{s1.searchsorted(4)=}')
s1 = pd.Series([3,4,5,2,1],index=['A','B','A','B','A'])
print(f'{s1.searchsorted(4)=}')
print('s1.repeat(2):\n', s1.repeat(2))
s2 = s1.view('int64')
s2['A'] = 1234567890
print(f'{s1=}')

‘’’
s2 = s1.argsort():
 A    4
B    3
C    2
D    1
E    0
dtype: int64
s1.argmin()=3
s1.argmax()=0
s1.sort_values():
 B    1
A    1
A    3
A    5
B    5
dtype: int64
s1.sort_index():
 A    5
A    3
A    1
B    5
B    1
dtype: int64
s2.reorder_levels([1,0,2]:
 S5  A  C1    5
S2  B  C2    5
S4  A  C3    3
S3  B  C4    1
S1  A  C5    1
dtype: int64
s2.swaplevel(0):
 C1  S5  A    5
C2  S2  B    5
C3  S4  A    3
C4  S3  B    1
C5  S1  A    1
dtype: int64
s2.swaplevel(1,2):
 A  C1  S5    5
B  C2  S2    5
A  C3  S4    3
B  C4  S3    1
A  C5  S1    1
dtype: int64
s2.unstack(level=1,fill_value=0):
       S1  S2  S3  S4  S5
A C1   0   0   0   0   5
  C3   0   0   0   3   0
  C5   1   0   0   0   0
B C2   0   5   0   0   0
  C4   0   0   1   0   0
s3:
0    [1, 2, 3]
1          foo
2           []
3       [5, 6]
dtype: object
s3.explode():
 0      1
0      2
0      3
1    foo
2    NaN
3      5
3      6
dtype: object
s1.searchsorted(4)=3
s1.searchsorted(4)=1
s1.repeat(2):
 A    3
A    3
B    4
B    4
A    5
A    5
B    2
B    2
A    1
A    1
dtype: int64
s1=A    1234567890
B             4
A    1234567890
B             2
A    1234567890
dtype: int64

‘''

操作单个元素

方法名	说明
s.drop([labels, axis, index, columns, …])	返回删除指定索引的元素的一个新Series，s没有影响
s.drop_duplicates(*[, keep, inplace, …])	返回删除重复元素的一个新Series，s没有影响
s.duplicated([keep])	逐个元素判断是否是重复元素，得到一个新的bool值的Series
s.pop(index)	获取index指定的元素，并将该元素从s中删除

批量操作元素

方法名	说明
s.all([axis, bool_only, skipna])	检查是否所有元素都为True
s.any(*[, axis, bool_only, skipna])	检查是否任意一元素为True
s.between(left, right[, inclusive])	检查元素是否在left和right之间（含边界值），返回bool的Series序列，NaN被认为是False
s.count()	统计Series中非空值的个数
s.cov(other[, min_periods, ddof])	计算s和other的协方差，s和other不要求有相同的长度
s.cummax([axis, skipna])	计算s的累计最大值，就是按顺序比较，如果当前值比当前为止的最大值大，就将当前值作为最大值，始终用最大值填充当前的位置
s.cummin([axis, skipna])	计算s的累计最小值
s.cumprod([axis, skipna])	计算s的累计乘积，将得到一个新的Series，命名为ns： ns的第0个元素等于s的第0个元素 ns的第1个元素等于ns的第0个元素与s第1个元素的乘积 ns的第2个元素等于ns的第1个元素与s第2个元素的乘积 …（以此类推）
s.cumsum([axis, skipna])	计算s的累计和
s.describe([percentiles, include, exclude])	得到s的统计信息，对于数字数据，包括count, mean, std, min, max等函数的饿值
s.diff([periods])	计算s的两个元素之间的差值：第0个差值为NaN 第1个差值为：第0个与第1个第2个差值为：第1个与第2个 … periods指定起始的位置，如果为-1，就是从最后一个往前算
s.max([axis, skipna, numeric_only])	返回s的最大元素
s.min([axis, skipna, numeric_only])	返回s的最小元素
s.mean([axis, skipna, numeric_only])	返回s的算术平均数
s.median([axis, skipna, numeric_only])	返回s的元素的中位数（不是平均值，大小在中间的那个）
s.mode([dropna])	返回重复次数最多的数，如果有最多的重复数是多个，则返回多个
s.nlargest([n, keep])	返回最大的n个元素
s.nsmallest([n, keep])	返回最小的n个元素
s.pct_change([periods, fill_method, …])	计算变化的比例：(当前元素-前一个元素)/前一个元素
s.prod([axis, skipna, numeric_only, …])	返回所有元素的乘积
s.std([axis, skipna, ddof, numeric_only])	求s所有元素的标准差
s.sum([axis, skipna, numeric_only, …])	求s所有元素的和
s.var([axis, skipna, ddof, numeric_only])	求s所有元素的无偏方差
s.unique()	返回s元素的唯一元素（去重）
s.nunique([dropna])	返回s中唯一元素的个数
s.equals(other)	检查s和other包含的元素是否一致，要求顺序和索引也是一致的
s.truncate([before, after, axis, copy])	截断before前和after后的元素，生成一个新的Series
s.replace([to_replace, value, inplace, …])	值替换，to_replace指定要替换的值，value替换后的值
s.compare(other[, align_axis, …])	比较s、other的元素，将有差异的元素生成DataFrame
s.update(other)	用other去更新s

import pandas as pd

s1 = pd.Series([10,20,30,40,50])
s2 = s1.between(20,40) #检查 20<=元素<=40
print(s2)

s1 = pd.Series([10,20,30,None,50])
print(f'{s1.count()=}')

s2 = pd.Series([10,20,60,50,70,66])
print('s2.cummax():\n', s2.cummax())

s1 = pd.Series([1,2,3,4,5])
print('s1.cumprod():\n', s1.cumprod())
print('s1.cumsum():\n', s1.cumsum())
print('s1.describe():\n', s1.describe())
print('s1.diff():\n', s1.diff())
print('s1.diff(periods=0):\n', s1.diff(periods=0))
print('s1.diff(periods=-1):\n', s1.diff(periods=-1))
print(f'{s1.max()=}')
print(f'{s1.min()=}')
print(f'{s1.mean()=}')
s2 = pd.Series([1,2,3,40,50])
print(f'{s2.median()=}')
print(f'{s2.mode()=}')
s3 = pd.Series([2,4,2,4,3,2,5])
print(f'{s3.mode()=}')
print(f'{s1.nlargest(2)=}')
print(f'{s1.nsmallest(2)=}')
print(f'{s1.pct_change()=}')
print(f'{s1.prod()=}')
print(f'{s1.std()=}')
print(f'{s1.var()=}')
print(f'{s3.unique()=}')
print(f'{s1.nunique()=}')

’’’
0    False
1     True
2     True
3     True
4    False
dtype: bool
s1.count()=4
s2.cummax():
 0    10
1    20
2    60
3    60
4    70
5    70
dtype: int64
s1.cumprod():
 0      1
1      2
2      6
3     24
4    120
dtype: int64
s1.cumsum():
 0     1
1     3
2     6
3    10
4    15
dtype: int64
s1.describe():
 count    5.000000
mean     3.000000
std      1.581139
min      1.000000
25%      2.000000
50%      3.000000
75%      4.000000
max      5.000000
dtype: float64
s1.diff():
 0    NaN
1    1.0
2    1.0
3    1.0
4    1.0
dtype: float64
s1.diff(periods=0):
 0    0.0
1    0.0
2    0.0
3    0.0
4    0.0
dtype: float64
s1.diff(periods=-1):
 0   -1.0
1   -1.0
2   -1.0
3   -1.0
4    NaN
dtype: float64
s1.max()=5
s1.min()=1
s1.mean()=3.0
s2.median()=3.0
s2.mode()=0     1
1     2
2     3
3    40
4    50
dtype: int64
s3.mode()=0    2
dtype: int64
s1.nlargest(2)=4    5
3    4
dtype: int64
s1.nsmallest(2)=0    1
1    2
dtype: int64
s1.pct_change()=0         NaN
1    1.000000
2    0.500000
3    0.333333
4    0.250000
dtype: float64
s1.prod()=120
s1.std()=1.5811388300841898
s1.var()=2.5
s3.unique()=array([2, 4, 3, 5])
s1.nunique()=5
‘’‘

s7 = pd.Series([20,20,30,30,20])
s8 = s7.replace(20, 100)
print('s8 = s7.replace(1, 100):\n', s8)

’’’
s8 = s7.replace(1, 100):
 0    100
1    100
2     30
3     30
4    100
dtype: int64
‘’‘

处理空值

方法名	说明
s.backfill(*[, axis, inplace, limit, …])	使用后面非空的值填充空值，得到一个新的Series
s.bfill(*[, axis, inplace, limit, downcast])	使用后面非空的值填充空值，得到一个新的Series
s.dropna(*[, axis, inplace, how, …])	删除空值，得到一个新的Series
s.ffill([, axis, inplace, limit, downcast]) s.pad([, axis, inplace, limit, downcast])	使用紧接着的前面值填充空值，得到一个新的Series
s.fillna([value, method, axis, …])	使用值value或方法method去填充空值。如果value是一个标量值，所有的控制都填充为value；如果value是一个字典dict，key指定的是s的index，将对应的用字典的值去替换key对应的index位置的空值进行替换；如果是method方法，指定的是前面的bfill、backfill、ffill等方法，得到一个新的Series
s.interpolate([method, axis, limit, …])	使用插值法填充空值，得到一个新的Series
s.isna() s.isnull()	检测控制，得到一个新的bool类型的Series，对应的元素如果是空值为True，否则为False
s.notna() s.notnull()	与isna()一样，只是如果是空值则为False
s.first_valid_index()	返回第一个非空值的索引
s.last_valid_index()	返回最后一个非空值的索引

s1 = pd.Series([None, None, 3, 4, None], index=['A', 'B', 'C', 'D', 'E'])
print(s1.first_valid_index()) #C

高阶函数

方法名	说明
s.apply(func[, convert_dtype, args, by_row])	对Series的每个值调用func函数，func接收一个参数，当args指定n个参数时，就接收n+1个参数，第一个参数始终是每个元素的值
s.agg([func, axis])	对Series使用聚合函数，func必须聚合函数名的字符串
s.aggregate([func, axis])	对Series使用聚合函数，func必须聚合函数名的字符串
s.transform(func[, axis])	对每一个元素调用func，func只接收一个参数（每个元素轮一遍），结果组成一个新的Series
s.map(arg[, na_action])	如果arg是一个字典dic，就查找dic的key对应的s中的值，如果找不到就填充为NaN，找到就填充为s中的值，生成一个新的Series，如果arg是一个函数，与transform一样
s.groupby([by, axis, level, as_index, …])	按给定的标识去分组，by作为分组的标识
s.rolling(window[, min_periods, …])	滑动窗口计算，windows是指计算的元素有几个
s.expanding([min_periods, axis, method])	扩展窗口计算
s.ewm([com, span, halflife, alpha, …])	指数加权计算

s1 = pd.Series([2,30,4,31,32], index=['A','B','C','D','E'])
sg = s1.groupby(['1','2','1','2','2']) #把5个数按'1'，'2'进行分组
print(sg.groups) #打印两个分组的索引
#{'1': ['A', 'C'], '2': ['B', 'D', 'E']}
print(sg.mean()) #分组求平均
# 1     3.0
# 2    31.0
# dtype: float64

s1 = pd.Series([1,2,3,4,5,6])
print(s1.rolling(2).sum())

‘’’
0     NaN
1     3.0
2     5.0
3     7.0
4     9.0
5    11.0
dtype: float64
’’’

print(s1.expanding(2).sum())
‘’’
0     NaN
1     3.0
2     6.0
3    10.0
4    15.0
5    21.0
dtype: float64
’‘’

DataFrame

DataFrame 一个表格型的数据结构，类似于 Excel 、SQL 表，既有行标签（index），又有列标签（columns），它也被称异构数据表，所谓异构，指的是表格中每列的数据类型可以不同，比如可以是字符串、整型或者浮点型等。

DataFrame 的每一行数据都可以看成一个 Series 结构，只不过，DataFrame 为这些行中每个数据值增加了一个列标签。因此 DataFrame 其实是从 Series 的基础上演变而来。在数据分析任务中 DataFrame 的应用非常广泛，因为它描述数据的更为清晰、直观。

DataFrame 数据结构的特点做简单地总结，如下所示：

DataFrame 每一列的标签值允许使用不同的数据类型；

DataFrame 是表格型的数据结构，具有行和列；

DataFrame 中的每个数据值都可以被修改。

DataFrame 结构的行数、列数允许增加或者删除；

DataFrame 有两个方向的标签轴，分别是行标签和列标签；

DataFrame 可以对行和列执行算术运算。

DataFrame对象定义

pd.DataFrame( data, index, columns, dtype, copy)

参数说明：

data：输入的数据，可以是 ndarray，series，list，dict，标量以及一个 DataFrame。

index：行标签，如果没有传递 index 值，则默认行标签是 np.arange(n)，n 代表 data 的元素个数。

columns：列标签，如果没有传递 columns 值，则默认列标签是 np.arange(n)。

dtype：dtype表示每一列的数据类型。

copy：默认为 False，表示复制数据 data。

创建一个空的DataFrame：

import pandas as pd
df = pd.DataFrame()
print(df)
‘’’
Empty DataFrame
Columns: []
Index: []
’‘’

通过list创建DataFrame

可以通过list创建一个简单的只有一列的DataFrame，如：

import pandas as pd

df = pd.DataFrame([1,2,3,4,5,6])
print(df)
‘’’
   0
0  1
1  2
2  3
3  4
4  5
5  6
’‘’

df = pd.DataFrame([1,2,3,4,5,6], columns=['No']) #指定列名
print(df)
‘’’
   No
0   1
1   2
2   3
3   4
4   5
5   6
’‘’

也可以通过嵌套列，创建多列的DataFrame：

df = pd.DataFrame([['Alex', 10], ['John', 13], ['Rose', 8]], columns=['Name', 'Age'])
print(df)

‘’'
   Name  Age
0  Alex   10
1  John   13
2  Rose    8
‘''

通过dict创建DataFrame

通过dict创建DataFrame，每个key都是一列，value是具体的列值（一般为list），要求value的list是等长的。

import pandas as pd
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data)
print(df)

’’’
      Age      Name
0     28        Tom
1     34       Jack
2     29      Steve
3     42      Ricky
‘’‘

也可以通过列表中嵌套字典的方式，列表的每个元素都是一行，而嵌套的字典的key是列名，要求字典的key是一样的。

df = pd.DataFrame([{'Name':'Alex', 'Age':10}, {'Name':'John', 'Age':13}, {'Name': 'Rose', 'Age': 8}])
print(df)

‘’'
   Name  Age
0  Alex   10
1  John   13
2  Rose    8
‘''

通过Series创建DataFrame

可以传递一个字典形式的 Series，从而创建一个 DataFrame 对象，其输出结果的行索引是所有 index 的并集

df = pd.DataFrame({'Name':pd.Series(['Tom', 'Jack', 'Steve', 'Ricky','Bob'], index=['A', 'B', 'C', 'D', 'E']),
                                    'Age':pd.Series([28,34,29,42], index=['A', 'B', 'C', 'D'])})
print(df)

‘’'
    Name   Age
A    Tom  28.0
B   Jack  34.0
C  Steve  29.0
D  Ricky  42.0
E    Bob   NaN
‘''

注意：两个Series的索引一定要一样或大致一样，生成的DataFrame的行是两个Series的索引的并集，只有索引一样的对应的元素才会被整合在一行。

df = pd.DataFrame({'Name':pd.Series(['Tom', 'Jack', 'Steve', 'Ricky','Bob'], index=['A', 'B', 'C', 'D', 'E']),
                                    'Age':pd.Series([28,34,29,42])})
    Name   Age
A    Tom   NaN
B   Jack   NaN
C  Steve   NaN
D  Ricky   NaN
E    Bob   NaN
0    NaN  28.0
1    NaN  34.0
2    NaN  29.0
3    NaN  42.0

其他构建器

函数和方法名	说明
DataFrame.from_dict(dict)	接收字典组成的字典或数组序列字典，并生成 DataFrame
DataFrame.from_records	支持元组列表或结构数据类型（`dtype`）的多维数组

列索引操作

选取数据列

可以直接通过列索引下标获取列：

df = pd.DataFrame({'Name':pd.Series(['Tom', 'Jack', 'Steve', 'Ricky','Bob'], index=['A', 'B', 'C', 'D', 'E']),
                                    'Age':pd.Series([28,34,29,42], index=['A', 'B', 'C', 'D'])})
print('df:\n', df)
print('df["Name"]:\n', df["Name"])
print('df["Age"]:\n', df["Age”])

‘’'
df:
     Name   Age
A    Tom  28.0
B   Jack  34.0
C  Steve  29.0
D  Ricky  42.0
E    Bob   NaN
df["Name"]:
 A      Tom
B     Jack
C    Steve
D    Ricky
E      Bob
Name: Name, dtype: object
df["Age"]:
 A    28.0
B    34.0
C    29.0
D    42.0
E     NaN
Name: Age, dtype: float64
‘''

增加数据列

也可以直接通过列索引增加数据列，主要注意的是新增的列，索引一定要匹配，否则会增加一个全部为NaN值的列：

df = pd.DataFrame({'Name':pd.Series(['Tom', 'Jack', 'Steve', 'Ricky','Bob'], index=['A', 'B', 'C', 'D', 'E']),
                                    'Age':pd.Series([28,34,29,42], index=['A', 'B', 'C', 'D'])})
df['Score'] = pd.Series([90, 58, 99, 100, 48], index=['A', 'B', 'C', 'D', 'E'])
print(df)

’’’
    Name   Age  Score
A    Tom  28.0     90
B   Jack  34.0     58
C  Steve  29.0     99
D  Ricky  42.0    100
E    Bob   NaN     48
‘''

df['English'] = pd.Series([100, 100, 80, 100, 70])
print(df)
‘’'
    Name   Age  Score  English
A    Tom  28.0     90      NaN
B   Jack  34.0     58      NaN
C  Steve  29.0     99      NaN
D  Ricky  42.0    100      NaN
E    Bob   NaN     48      NaN
‘''

也可以直接引用DataFrame的列进行运算，增加计算列：

df = pd.DataFrame({'Name':pd.Series(['Tom', 'Jack', 'Steve', 'Ricky','Bob'], index=['A', 'B', 'C', 'D', 'E']),
                                    'Age':pd.Series([28,34,29,42], index=['A', 'B', 'C', 'D'])})
df['Math'] = pd.Series([90, 58, 99, 100, 48], index=['A', 'B', 'C', 'D', 'E'])
df['English'] = pd.Series([100, 100, 80, 100, 70], index=['A', 'B', 'C', 'D', 'E'])
df['Total'] = df['Math'] + df['English']
print(df)

‘’'
    Name   Age  Math  English  Total
A    Tom  28.0    90      100    190
B   Jack  34.0    58      100    158
C  Steve  29.0    99       80    179
D  Ricky  42.0   100      100    200
E    Bob   NaN    48       70    118
‘''

插入数据列

通过insert方法可以插入一列：

DataFrame.insert(loc, column, value, allow_duplicates=_NoDefault.no_default)

参数说明：

loc：插入索引的位置，必须是0 <= loc <= len(columns).

column：要插入的列名

value：插入的列的值，一般是Series或者可以转换为Series的类型

allow_duplicates：是否允许重复

df = pd.DataFrame({'Name':pd.Series(['Tom', 'Jack', 'Steve', 'Ricky','Bob'], index=['A', 'B', 'C', 'D', 'E']),
                                    'Age':pd.Series([28,34,29,42], index=['A', 'B', 'C', 'D'])})
df['Math'] = pd.Series([90, 58, 99, 100, 48], index=['A', 'B', 'C', 'D', 'E'])
df['English'] = pd.Series([100, 100, 80, 100, 70], index=['A', 'B', 'C', 'D', 'E'])
df.insert(2, 'Chinese', [100,99,98,96,90])
print(df)

’’’
   Name   Age  Chinese  Math  English
A    Tom  28.0      100    90      100
B   Jack  34.0       99    58      100
C  Steve  29.0       98    99       80
D  Ricky  42.0       96   100      100
E    Bob   NaN       90    48       70
‘’‘

删除数据列

通过 del 和 pop() 都能够删除 DataFrame 中的数据列。

df = pd.DataFrame({'Name':pd.Series(['Tom', 'Jack', 'Steve', 'Ricky','Bob'], index=['A', 'B', 'C', 'D', 'E']),
                                    'Age':pd.Series([28,34,29,42], index=['A', 'B', 'C', 'D'])})
df['Math'] = pd.Series([90, 58, 99, 100, 48], index=['A', 'B', 'C', 'D', 'E'])
df['English'] = pd.Series([100, 100, 80, 100, 70], index=['A', 'B', 'C', 'D', 'E'])
del df['Age']
print(df)

‘’'
   Name  Math  English
A    Tom    90      100
B   Jack    58      100
C  Steve    99       80
D  Ricky   100      100
E    Bob    48       70
‘''

pop()方法的定义如下：

DataFrame.pop(item)

参数说明：

item:列名

df = pd.DataFrame({'Name':pd.Series(['Tom', 'Jack', 'Steve', 'Ricky','Bob'], index=['A', 'B', 'C', 'D', 'E']),
                                    'Age':pd.Series([28,34,29,42], index=['A', 'B', 'C', 'D'])})
df['Math'] = pd.Series([90, 58, 99, 100, 48], index=['A', 'B', 'C', 'D', 'E'])
df['English'] = pd.Series([100, 100, 80, 100, 70], index=['A', 'B', 'C', 'D', 'E'])
df.pop('Age')
print(df)

‘’'
    Name  Math  English
A    Tom    90      100
B   Jack    58      100
C  Steve    99       80
D  Ricky   100      100
E    Bob    48       70
‘''

行索引操作

选取数据行

行索引操作，需要使用loc属性，使用中括号引用行，中括号内是行索引标识：

df = pd.DataFrame({'Name':pd.Series(['Tom', 'Jack', 'Steve', 'Ricky','Bob'], index=['A', 'B', 'C', 'D', 'E']),
                                    'Age':pd.Series([28,34,29,42], index=['A', 'B', 'C', 'D'])})
df['Math'] = pd.Series([90, 58, 99, 100, 48], index=['A', 'B', 'C', 'D', 'E'])
df['English'] = pd.Series([100, 100, 80, 100, 70], index=['A', 'B', 'C', 'D', 'E'])
print(df.loc['B’])

’’’
Name       Jack
Age        34.0
Math         58
English     100
Name: B, dtype: object
‘’‘

loc属性的中括号中也可以指定两个参数，第一个是行的索引标识，第二个是列名：

df = pd.DataFrame({'Name':pd.Series(['Tom', 'Jack', 'Steve', 'Ricky','Bob'], index=['A', 'B', 'C', 'D', 'E']),
                                    'Age':pd.Series([28,34,29,42], index=['A', 'B', 'C', 'D'])})
df['Math'] = pd.Series([90, 58, 99, 100, 48], index=['A', 'B', 'C', 'D', 'E'])
df['English'] = pd.Series([100, 100, 80, 100, 70], index=['A', 'B', 'C', 'D', 'E'])

print(df.loc['B', 'Age’]) #34.0

print(df.loc['B':'D’])
‘’'
    Name   Age  Math  English
B   Jack  34.0    58      100
C  Steve  29.0    99       80
D  Ricky  42.0   100      100
‘''

也可以使用切片，如上例。

也支持整数下标索引，需要使用iloc属性：

df = pd.DataFrame({'Name':pd.Series(['Tom', 'Jack', 'Steve', 'Ricky','Bob'], index=['A', 'B', 'C', 'D', 'E']),
                                    'Age':pd.Series([28,34,29,42], index=['A', 'B', 'C', 'D'])})
df['Math'] = pd.Series([90, 58, 99, 100, 48], index=['A', 'B', 'C', 'D', 'E'])
df['English'] = pd.Series([100, 100, 80, 100, 70], index=['A', 'B', 'C', 'D', 'E'])
print(df.iloc[1])
‘’’
Name       Jack
Age        34.0
Math         58
English     100
Name: B, dtype: object
’’’

print(df.iloc[1:3])
‘’’
    Name   Age  Math  English
B   Jack  34.0    58      100
C  Steve  29.0    99       80
’’’

print(df.iloc[1, 2]) #58

增加数据行

可以像增加列一样，直接对loc进行行增加：

df = pd.DataFrame({'Name':pd.Series(['Tom', 'Jack', 'Steve', 'Ricky','Bob'], index=['A', 'B', 'C', 'D', 'E']),
                                    'Age':pd.Series([28,34,29,42], index=['A', 'B', 'C', 'D'])})
df['Math'] = pd.Series([90, 58, 99, 100, 48], index=['A', 'B', 'C', 'D', 'E'])
df['English'] = pd.Series([100, 100, 80, 100, 70], index=['A', 'B', 'C', 'D', 'E'])

df.loc['F'] = ['John', 51, 88, 89]
print(df)

‘’’
   Name   Age  Math  English
A    Tom  28.0    90      100
B   Jack  34.0    58      100
C  Steve  29.0    99       80
D  Ricky  42.0   100      100
E    Bob   NaN    48       70
F   John  51.0    88       89
’‘’

但是不能使用iloc增加，会提示IndexError: iloc cannot enlarge its target object

删除数据行

DataFrame.drop(labels=None, *, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')

这个方法可以删除行，也可以删除列，如果未设置inplace，将得到删除数据后的一个新的DataFrame，原数据没有改变。

参数说明：

labels：行、列的标签名，默认是行，和后面的axis配合使用

axis：默认是行，如果axis=1，则labels是列标签

index：直接指定行标签

columns：直接指定列标签

import pandas as pd

df = pd.DataFrame({'Name':pd.Series(['Tom', 'Jack', 'Steve', 'Ricky','Bob'], index=['A', 'B', 'C', 'D', 'E']),
                                    'Age':pd.Series([28,34,29,42], index=['A', 'B', 'C', 'D'])})
df['Math'] = pd.Series([90, 58, 99, 100, 48], index=['A', 'B', 'C', 'D', 'E'])
df['English'] = pd.Series([100, 100, 80, 100, 70], index=['A', 'B', 'C', 'D', 'E'])

df2 = df.drop(['B','C'], axis=0)
print('df:\n', df, '\ndf2:\n', df2)
‘’’
df:
     Name   Age  Math  English
A    Tom  28.0    90      100
B   Jack  34.0    58      100
C  Steve  29.0    99       80
D  Ricky  42.0   100      100
E    Bob   NaN    48       70 
df2:
     Name   Age  Math  English
A    Tom  28.0    90      100
D  Ricky  42.0   100      100
E    Bob   NaN    48       70
’‘’

print("df.drop(index=['A', 'D']):\n", df.drop(index=['A', 'D']))
‘’’
df.drop(index=['A', 'D']):
     Name   Age  Math  English
B   Jack  34.0    58      100
C  Steve  29.0    99       80
E    Bob   NaN    48       70
’‘’

其他常见方法

称	属性&方法描述
T	行和列转置。
axes	返回一个仅以行轴标签和列轴标签为成员的列表。
dtypes	返回每列数据的数据类型。
empty	DataFrame中没有数据或者任意坐标轴的长度为0，则返回True。
ndim	轴的数量，也指数组的维数。
shape	返回一个元组，表示了 DataFrame 维度。
size	DataFrame中的元素数量。
values	使用 numpy 数组表示 DataFrame 中的元素值。
head()	返回前 n 行数据。
tail()	返回后 n 行数据。
shift()	将行或列移动指定的步幅长度