代码收藏家技术教程 2024-02-18

Pandas实践经验分享

DataFrame

去重drop_duplicates

参考：官方文档-pandas.DataFrame.drop_duplicates

参数：
subset：按照subset指定的列进行去重，默认为所有列；
keep：{‘first’, ‘last’, False}，保留first或last，如果为False，则删除所有重复记录；

创建DataFrame

df_tmp = pd.DataFrame({
    'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
    'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
    'rating': [4, 4, 3.5, 15, 5]
})
"""
>>> df_tmp
     brand style  rating
0  Yum Yum   cup     4.0
1  Yum Yum   cup     4.0
2  Indomie   cup     3.5
3  Indomie  pack    15.0
4  Indomie  pack     5.0
"""

不加参数使用，去除所有字段都重复的记录

>>> df_tmp.drop_duplicates()
     brand style  rating
0  Yum Yum   cup     4.0
2  Indomie   cup     3.5
3  Indomie  pack    15.0
4  Indomie  pack     5.0

加上subset参数，默认保留第一个重复记录

>>> df_tmp.drop_duplicates(subset="brand")
     brand style  rating
0  Yum Yum   cup     4.0
2  Indomie   cup     3.5

加载parquet文件

import pandas as pd

# 读取parquet文件
df = pd.read_parquet('file.parquet')

还可以使用pyarrow：

import pyarrow.parquet as pq

# 读取parquet文件
table = pq.read_table('file.parquet')

# 将数据转换为pandas DataFrame
df = table.to_pandas()

加载XLSX文件报错

df = pd.read_excel("test.xlsx", engine='openpyxl')

ValueError: Value must be either numerical or a string containing a wildcard
参考网址：openpyxl Value must be either numerical or a string containing a wildcard

openpyxl 在 3.1.0 之后版本引入了一个新的 bug，具体问题可以参考 issue-1959。一旦你使用的 Excel 文件中的通配符(wildcard) 不存在数字的话，就会上述错误。

方法一：将 openpyxl 的版本回到到 3.0.10 即可解决