Python数据分析利器Pandas模块详解

pandas模块

包含三种数据结构：Series，DataFrame，Panel

1. 模块

1.1Series

1.1.1 创建

（1）从列表创建`Series`

import pandas as pd

data = [10, 20, 30]
s = pd.Series(data)

print(s)
# 0    10
# 1    20
# 2    30
# dtype: int64

（2）自定义索引

s = pd.Series([100, 200, 300], index=["a", "b", "c"])
print(s)
# a    100
# b    200
# c    300
# dtype: int64

（3）从字典创建 `Series`

data = {"Math": 90, "English": 85, "Python": 95}
s = pd.Series(data)
print(s)
# Math       90
# English    85
# Python     95
# dtype: int64

1.1.2 常用方法

方法 / 属性	说明
`s.index`	查看索引
`s.values`	查看数据（`ndarray`）
`s.dtype`	数据类型
`s.head(n)`	查看前 `n` 个元素
`s.tail(n)`	查看后 `n` 个元素
`s.isnull()`	判断空值
`s.notnull()`	判断非空值
`s.sum()` / `mean()`	求和、均值等
`s.sort_index()`	按索引排序
`s.sort_values()`	按值排序

1.2DataFrame

（1）创建

同Series

（2）常用方法

属性/方法	作用
`df.shape`	返回行列数量 (行数, 列数)
`df.columns`	获取所有列名
`df.index`	获取行索引
`df.values`	获取二维数组形式的数据内容
`df.dtypes`	查看每列数据类型
`df.info()`	概况：行数、列数、非空值等
`df.describe()`	数值型数据的统计汇总
`df.head(n)`	显示前 `n` 行
`df.tail(n)`	显示后 `n` 行

（3）数据访问与取值

import pandas as pd

data = {
    "Name": ["Alice", "Bob", "Charlie"],
    "Age": [20, 21, 19],
    "Score": [85, 90, 95]
}
index_labels = ["a", "b", "c"]

df = pd.DataFrame(data)
print(df)
#       Name  Age  Score
# a    Alice   20     85
# b      Bob   21     90
# c  Charlie   19     95

按列名访问

df["Age"]           # 返回一个 Series
# a    20
# b    21
# c    19
# Name: Age, dtype: int64

df[["Age", "Score"]] # 返回一个新的 DataFrame
#    Age  Score
# a   20     85
# b   21     90
# c   19     95

按行访问：`loc` 和 `iloc`

df.loc["a"]      # 按标签访问  
df.iloc[0]       # 按位置访问

# Name     Alice
# Age         20
# Score       85
# Name: a, dtype: object

行列组合访问

df.loc["a", "Name"]
# Alice
df.iloc[0, 1]
# 20

（4）添加/删除列

添加新列：

df["Passed"] = df["Score"] >= 90
#       Name  Age  Score  Passed
# a    Alice   20     85   False
# b      Bob   21     90    True
# c  Charlie   19     95    True

删除列：

axis = 1表示删除的是列

axis = 0表示删除的是行

df.drop("Age", axis=1, inplace=True)
#       Name  Score  passed
# a    Alice     85   False
# b      Bob     90    True
# c  Charlie     95    True

2. 数据选择与索引操作

操作	方法
按标签取值	`df.loc[行, 列]`
按位置取值	`df.iloc[行号, 列号]`
条件筛选	`df[条件表达式]`
切片	`df[起:止]` 或 `loc[]` 切片
修改值	`df.loc[] = 新值`
删除行列	`df.drop(..., axis=...)`

import pandas as pd

df = pd.DataFrame({
    "Name": ["Tom", "Jerry", "Alice"],
    "Age": [20, 22, 19],
    "Score": [85, 90, 95]
}, index=["a", "b", "c"])

print(df)
#     Name  Age  Score
# a    Tom   20     85
# b  Jerry   22     90
# c  Alice   19     95

（1）使用 `loc[]`（按标签）

# 单个元素
print(df.loc["a", "Age"])       # 输出 20

# 单行
print(df.loc["b"])              # 第二行所有信息

# 多行多列
print(df.loc[["a", "c"], ["Name", "Score"]])
#     Name  Score
# a    Tom     85
# c  Alice     95

（2）使用 `iloc[]`（按位置）

# 第0行 第1列
print(df.iloc[0, 1])            # 输出 20

# 第1行
print(df.iloc[1])

# 前两行的前两列
print(df.iloc[0:2, 0:2])

（3）切片操作

行切片（包含起始，排除终点）：

标签切片是包含终点的

df.loc["a":"b"]中是包含b的，这就是.loc[]的特殊之处

print(df[0:2])         # 前两行（和 iloc 一样）
print(df.loc["a":"b"]) # 标签切片是包含“终点”的！(同样为输出前两行)

列切片（建议用 `.loc[]`）

print(df.loc[:, "Name":"Score"])  # 所有行，列从 Name 到 Score（包含终点）

（4）条件筛选（布尔索引）

# 多条件筛选（必须加括号）
print(df[(df["Age"] > 19) & (df["Score"] > 85)])
#     Name  Age  Score
# b  Jerry   22     90

（5）修改数据

# 修改单个值（直接赋值操作）
df.loc["a", "Score"] = 88

# 修改整列
df["Age"] = df["Age"] + 1 # 把所有人的年龄都加 1

# 修改某些行
df.loc[df["Name"] == "Tom", "Age"] = 100

（6）删除数据

删除列：

df.drop("Age", axis=1, inplace=True)

删除行：

df.drop("b", axis=0, inplace=True)

3. 数据清洗与预处理

import pandas as pd
import numpy as np

df = pd.DataFrame({
    "Name": ["Tom", "Jerry", "Tom", " Alice ", np.nan],
    "Age": [20, np.nan, 20, 19, 22],
    "Score": [85, 90, 0, 95, np.nan],
    "Email": ["tom@example.com", "jerry@cat.com", "tom@example.com", "alice@example.com", "bob@unknown.com"],
    "Date": ["2024-01-01", "2024-01-03", "2024-01-01", "2024-01-05", "2024-01-07"]
})

print(df)

输出如下：

      Name   Age  Score             Email         Date
0      Tom  20.0   85.0   tom@example.com   2024-01-01
1    Jerry   NaN   90.0    jerry@cat.com   2024-01-03
2      Tom  20.0    0.0   tom@example.com   2024-01-01
3   Alice    19.0   95.0  alice@example.com 2024-01-05
4      NaN  22.0    NaN    bob@unknown.com 2024-01-07

（1）缺失值处理（NaN）

判断缺失值

df.isnull()        # 返回布尔表：缺失为 True
#     Name    Age  Score  Email   Date
# 0  False  False  False  False  False
# 1  False   True  False  False  False
# 2  False  False  False  False  False
# 3  False  False  False  False  False
# 4   True  False   True  False  False

df.isnull().sum()  # 每列缺失数量统计
# Name     1
# Age      1
# Score    1
# Email    0
# Date     0
# dtype: int64

df.notnull()       # 非缺失为 True

删除缺失值

df.dropna()                      # 删除含有 NaN 的行
df.dropna(axis=1)               # 删除含有 NaN 的列
df.dropna(subset=["Age"])       # 仅当 "Age" 列为 NaN 时删除行

填充缺失值

df.fillna(0)                    # 所有 NaN 用 0 填充
df["Age"].fillna(df["Age"].mean()) # 用均值填充某列
df.fillna(method="ffill")      # 用前一个值填充（前向填充）

（2）重复值处理

import pandas as pd
import numpy as np

df = pd.DataFrame({
    "Name": ["Tom", "Jerry", "Tom", " Alice ", np.nan],
    "Age": [20, np.nan, 20, 19, 22],
    "Score": [85, 90, 85, 95, np.nan],
    "Email": ["tom@example.com", "jerry@cat.com", "tom@example.com", "alice@example.com", "bob@unknown.com"],
    "Date": ["2024-01-01", "2024-01-03", "2024-01-01", "2024-01-05", "2024-01-07"]
})

查找重复行

df.duplicated()                # 标记每行是否重复（返回布尔类型）
# 0    False
# 1    False
# 2     True
# 3    False
# 4    False
# dtype: bool

df[df.duplicated()]            # 查看重复行
#   Name   Age  Score            Email        Date
# 2  Tom  20.0   85.0  tom@example.com  2024-01-01

删除重复行

df.drop_duplicates(inplace=True)

（3）字符串处理（非常常见）

df["Name"].str.lower()         # 转小写
df["Name"].str.strip()         # 去除前后空格
df["Name"].str.contains("Tom", na=False) # 包含某字符串
# 如果包含返回 True，否则返回 False；na=False 选项用于处理 NaN，确保没有值的行不会返回错误，而是直接返回 False
df["Email"].str.split("@")     # 按@切分邮箱

（4）数据类型转换

查看数据类型

df.dtypes

强制类型转换

df["Age"] = df["Age"].astype(int)       # 转整数
df["Date"] = pd.to_datetime(df["Date"]) # 转时间
df["Name"] = df["Name"].astype("category") # 转换为分类数据
# Name     category
# Age        int64
# Date     datetime64[ns]
# Score      int64
# dtype: object

（5）重命名行列

df.rename(columns={"Age": "年龄"}, inplace=True)
df.rename(index={0: "第一行"}, inplace=True)

（6）替换数据

df["Score"].replace(0, 60, inplace=True)    # 将0分替换为60分
df.replace({"Tom": "Thomas"}, inplace=True) # 多列全局替换

4. 数据统计与分组聚合

import pandas as pd

# 创建一个简单的 DataFrame
data = {
    "Name": ["Tom", "Jerry", "Alice", "Bob", "Eve"],
    "Score": [85, 90, 95, 75, 88],
    "Age": [20, 22, 19, 21, 23]
}

df = pd.DataFrame(data)

#     Name  Score  Age
# 0    Tom     85   20
# 1  Jerry     90   22
# 2  Alice     95   19
# 3    Bob     75   21
# 4    Eve     88   23

（1）基本统计方法

适用于Series 或 DataFrame 中的数值列：

df["Score"].mean()        # 均值
df["Score"].sum()         # 总和
df["Score"].max()         # 最大值
df["Score"].min()         # 最小值
df["Score"].std()         # 标准差
df["Score"].count()       # 非空值数量
df["Score"].value_counts()# 统计每个值的出现次数

（2）describe() 一键查看所有统计

df.describe()
#             Score       Age
# count   5.000000   5.000000
# mean   86.600000  21.000000
# std     7.636232   1.587401
# min    75.000000  19.000000
# 25%    80.000000  20.000000
# 50%    85.000000  21.000000
# 75%    90.000000  22.000000
# max    95.000000  23.000000

它会输出 count、mean、std、min、25% 分位数、median（50%）、75%、max，非常常用！

（3）分组：groupby()

基本用法：

df.groupby("列名")["其他列"].操作()

示例：

data = {
    "Class": ["A", "A", "B", "B", "B"],
    "Name": ["Tom", "Jerry", "Alice", "Bob", "Eve"],
    "Score": [85, 90, 95, 75, 88]
}

df = pd.DataFrame(data)

# 每个班级的平均分
df.groupby("Class")["Score"].mean()
# Class
# A    87.5
# B    86.0
# Name: Score, dtype: float64

# 每个班级的人数
df.groupby("Class")["Name"].count()
# Class
# A    2
# B    3
# Name: Name, dtype: int64

（4）多列分组

data = {
    "Year": [2020, 2020, 2021, 2021, 2021],
    "Class": ["A", "B", "A", "B", "A"],
    "Score": [85, 90, 95, 75, 88]
}

df = pd.DataFrame(data)

# 按年和班级分组并计算平均分
grouped_score = df.groupby(["Year", "Class"])["Score"].mean()
print(grouped_score)
# Year  Class
# 2020  A        85.0
#       B        90.0
# 2021  A        91.5
#       B        75.0
# Name: Score, dtype: float64

（5）多函数聚合：agg()

# 对每个班的成绩求 平均值 和 最大值
df.groupby("Class")["Score"].agg(["mean", "max"])
#        mean  max
# Class           
# A      86.0   95
# B      82.5   90

# 每个班的平均分 + 学生数量
df.groupby("Class").agg({
    "Score": "mean",
    "Name": "count"
})
#        Score  Name
# Class             
# A       86.0     3
# B       82.5     2

（6）分组排序

# 班级内按分数排序（班级按升序排序，分数按降序排序）
df.sort_values(["Class", "Score"], ascending=[True, False])
#    Year Class  Score
# 1  2020     B     90
# 3  2021     B     75
# 4  2021     A     88
# 0  2020     A     85
# 2  2021     A     95

（7）透视表：pivot_table

pivot_table() 允许你快速按行列维度汇总数据，功能类似 Excel 透视表。

# 按班级统计平均分
df.pivot_table(values="Score", index="Class", aggfunc="mean")
#            Score
# Class           
# A      89.333333
# B      82.500000

# 同时统计人数与平均分
df.pivot_table(values="Score", index="Class", aggfunc=["count", "mean"])
#       count       mean
#       Score      Score
# Class                 
# A         3  89.333333
# B         2  82.500000

5. 数据合并与连接

（1）`concat()`：按行或列拼接多个 DataFrame

按行拼接（默认 `axis=0`）

import pandas as pd

df1 = pd.DataFrame({
    "Name": ["Tom", "Jerry"],
    "Score": [85, 90]
})

df2 = pd.DataFrame({
    "Name": ["Alice", "Bob"],
    "Score": [95, 80]
})

result = pd.concat([df1, df2])
#     Name  Score
# 0    Tom     85
# 1  Jerry     90
# 0  Alice     95
# 1    Bob     80

设置如下：

result = pd.concat([df1, df2], ignore_index=True)
#     Name  Score
# 0    Tom     85
# 1  Jerry     90
# 2  Alice     95
# 3    Bob     80

按列拼接（设置 `axis=1`）

df3 = pd.DataFrame({
    "Age": [20, 22, 21, 23]
})

result = pd.concat([result, df3], axis=1)

（2）`merge()`：按某一列（主键）合并表格

单列合并（默认是 inner join）

students = pd.DataFrame({
    "ID": [1, 2, 3],
    "Name": ["Tom", "Jerry", "Alice"]
})

scores = pd.DataFrame({
    "ID": [1, 2, 4],
    "Score": [85, 90, 88]
})

result = pd.merge(students, scores, on="ID")

#    ID   Name  Score
# 0   1    Tom     85
# 1   2  Jerry     90

左连接（保留 students 中的所有记录）

result_left = pd.merge(students, scores, on="ID", how="left")
print(result_left)

输出如下：

   ID   Name  Score
0   1    Tom   85.0
1   2  Jerry   90.0
2   3  Alice    NaN

右连接（保留 scores 中的所有记录）

result_right = pd.merge(students, scores, on="ID", how="right")
print(result_right)

输出如下：

   ID   Name  Score
0   1    Tom   85.0
1   2  Jerry   90.0
2   4    NaN   88.0

外连接（保留两个 DataFrame 中所有记录，若某一方没有匹配的记录，用 NaN 填充）

result_outer = pd.merge(students, scores, on="ID", how="outer")
print(result_outer)

输出如下：

   ID   Name  Score
0   1    Tom   85.0
1   2  Jerry   90.0
2   3  Alice    NaN
3   4    NaN   88.0

合并方式（`how` 参数）

how	含义
`inner`	交集（默认）
`left`	左表为主
`right`	右表为主
`outer`	并集（全并入）

（3）多列连接

pd.merge(df1, df2, on=["ID", "Year"])

import pandas as pd

# 创建第一个 DataFrame df1，包含学生 ID、姓名和学年
df1 = pd.DataFrame({
    "ID": [1, 2, 3, 4],
    "Name": ["Tom", "Jerry", "Alice", "Bob"],
    "Year": [2021, 2021, 2022, 2022]
})

# 创建第二个 DataFrame df2，包含学生 ID、课程和成绩
df2 = pd.DataFrame({
    "ID": [1, 2, 3, 4],
    "Year": [2021, 2021, 2022, 2022],
    "Course": ["Math", "English", "History", "Science"],
    "Score": [85, 90, 88, 92]
})

# 使用 pd.merge 根据 "ID" 和 "Year" 合并两个 DataFrame
result = pd.merge(df1, df2, on=["ID", "Year"])

print(result)

输出如下：

   ID   Name  Year   Course  Score
0   1    Tom  2021     Math     85
1   2  Jerry  2021  English     90
2   3  Alice  2022  History     88
3   4    Bob  2022  Science     92

（4）`join()`：以索引为依据连接两个表

df1 = pd.DataFrame({"Score": [85, 90]}, index=["Tom", "Jerry"])
df2 = pd.DataFrame({"Age": [20, 21]}, index=["Tom", "Jerry"])

result = df1.join(df2) # 将 df2 合并到 df1 上

输出如下：

       Score  Age
Tom       85   20
Jerry      90   21

作者：TY-2025

物联沃分享整理
物联沃-IOTWORD物联网 » Python数据分析利器Pandas模块详解

代码收藏家普通

分享到：

pandas模块

1. 模块

1.1Series

1.1.1 创建

（1）从列表创建Series

（2）自定义索引

（3）从字典创建 Series

1.1.2 常用方法

1.2DataFrame

（1）创建

（2）常用方法

（3） 数据访问与取值

按列名访问

按行访问：loc 和 iloc

行列组合访问

（4）添加/删除列

添加新列：

删除列：

2. 数据选择与索引操作

（1） 使用 loc[]（按标签）

（2） 使用 iloc[]（按位置）

（3）切片操作

行切片（包含起始，排除终点）：

列切片（建议用 .loc[]）

（4）条件筛选（布尔索引）

（5）修改数据

（6）删除数据

删除列：

删除行：

3. 数据清洗与预处理

（1）缺失值处理（NaN）

判断缺失值

删除缺失值

填充缺失值

（2）重复值处理

查找重复行

删除重复行

（3）字符串处理（非常常见）

（4）数据类型转换

查看数据类型

强制类型转换

（5）重命名行列

（6） 替换数据

4. 数据统计与分组聚合

（1）基本统计方法

（2）describe() 一键查看所有统计

（3）分组：groupby()

基本用法：

示例：

（4） 多列分组

（5）多函数聚合：agg()

（6） 分组排序

（7）透视表：pivot_table

5. 数据合并与连接

（1）concat()：按行或列拼接多个 DataFrame

按行拼接（默认 axis=0）

按列拼接（设置 axis=1）

（2）merge()：按某一列（主键）合并表格

单列合并（默认是 inner join）

合并方式（how 参数）

（3）多列连接

（4）join()：以索引为依据连接两个表

代码收藏家 普通

相关推荐

发表回复 取消回复

（1）从列表创建`Series`

（3）从字典创建 `Series`

（3）数据访问与取值

按行访问：`loc` 和 `iloc`

（1）使用 `loc[]`（按标签）

（2）使用 `iloc[]`（按位置）

列切片（建议用 `.loc[]`）

（6）替换数据

（4）多列分组

（6）分组排序

（1）`concat()`：按行或列拼接多个 DataFrame

按行拼接（默认 `axis=0`）

按列拼接（设置 `axis=1`）

（2）`merge()`：按某一列（主键）合并表格

合并方式（`how` 参数）

（4）`join()`：以索引为依据连接两个表

代码收藏家普通

发表回复取消回复