Python基础(十一) | 超详细的Pandas库三万字总结

⭐本专栏旨在对Python的基础语法进行详解,精炼地总结语法中的重点,详解难点,面向零基础及入门的学习者,通过专栏的学习可以熟练掌握python编程,同时为后续的数据分析,机器学习及深度学习的代码能力打下坚实的基础。

🔥本文已收录于Python基础系列专栏: Python基础系列教程 欢迎订阅,持续更新。

image-20221002210956895

image-20221002211012978

image-20221002211020709

文章目录

  • 引子
  • 11.1 对象创建
  • 11.1.1 Pandas Series对象
  • 11.1.2 Pandas DataFrame对象
  • 11.2 DataFrame性质
  • 11.3 数值运算及统计分析
  • 11.4 缺失值处理
  • 11.5 合并数据
  • 11.6 分组和数据透视表
  • 11.7 其他
  • 引子

    Numpy 在向量化的数值计算中表现优异

    但是在处理更灵活、复杂的数据任务:

    如为数据添加标签、处理缺失值、分组和透视表等方面

    Numpy显得力不从心

    而基于Numpy构建的Pandas库,提供了使得数据分析变得更快更简单的高级数据结构和操作工具

    11.1 对象创建

    image-20221002211040984

    11.1.1 Pandas Series对象

    Series 是带标签数据的一维数组

    Series对象的创建

    通用结构: pd.Series(data, index=index, dtype=dtype)

    data:数据,可以是列表,字典或Numpy数组

    index:索引,为可选参数

    dtype: 数据类型,为可选参数

    1、用列表创建

  • index缺省,默认为整数序列
  • import pandas as pd
    
    data = pd.Series([1.5, 3, 4.5, 6])
    data
    
    0    1.5
    1    3.0
    2    4.5
    3    6.0
    dtype: float64
    
  • 增加index
  • data = pd.Series([1.5, 3, 4.5, 6], index=["a", "b", "c", "d"])
    data
    
    a    1.5
    b    3.0
    c    4.5
    d    6.0
    dtype: float64
    
  • 增加数据类型

    缺省则从传入的数据自动判断

  • data = pd.Series([1, 2, 3, 4], index=["a", "b", "c", "d"])    
    data
    
    a    1
    b    2
    c    3
    d    4
    dtype: int64
    
    data = pd.Series([1, 2, 3, 4], index=["a", "b", "c", "d"], dtype="float")
    data
    
    a    1.0
    b    2.0
    c    3.0
    d    4.0
    dtype: float64
    

    注意:数据支持多种类型

  • 混合后数据类型变为object
  • data = pd.Series([1, 2, "3", 4], index=["a", "b", "c", "d"])
    data
    
    a    1
    b    2
    c    3
    d    4
    dtype: object
    
    data["a"]
    
    1
    
    data["c"]
    
    '3'
    

    数据类型可被强制改变

    data = pd.Series([1, 2, "3", 4], index=["a", "b", "c", "d"], dtype=float)
    data
    
    a    1.0
    b    2.0
    c    3.0
    d    4.0
    dtype: float64
    
    data["c"]
    
    3.0
    

    不能转为浮点数则会报错

    data = pd.Series([1, 2, "a", 4], index=["a", "b", "c", "d"], dtype=float)
    data
    
    ---------------------------------------------------------------------------
    
    NameError                                 Traceback (most recent call last)
    
    ~\AppData\Local\Temp/ipykernel_9236/4046912764.py in <module>
    ----> 1 data = pd.Series([1, 2, "a", 4], index=["a", "b", "c", "d"], dtype=float)
          2 data
    
    
    NameError: name 'pd' is not defined
    

    2、用一维numpy数组创建

    import numpy as np
    
    x = np.arange(5)
    pd.Series(x)
    
    0    0
    1    1
    2    2
    3    3
    4    4
    dtype: int32
    

    3、用字典创建

  • 默认以键为index 值为data
  • population_dict = {"BeiJing": 2154,
                       "ShangHai": 2424,
                       "ShenZhen": 1303,
                       "HangZhou": 981 }
    population = pd.Series(population_dict)    
    population
    
    BeiJing     2154
    ShangHai    2424
    ShenZhen    1303
    HangZhou     981
    dtype: int64
    
  • 字典创建,如果指定index,则会到字典的键中筛选,找不到的,值设为NaN
  • population = pd.Series(population_dict, index=["BeiJing", "HangZhou", "c", "d"])    
    population
    
    BeiJing     2154.0
    HangZhou     981.0
    c              NaN
    d              NaN
    dtype: float64
    

    4、data为标量的情况

    pd.Series(5, index=[100, 200, 300])
    
    100    5
    200    5
    300    5
    dtype: int64
    

    11.1.2 Pandas DataFrame对象

    DataFrame 是带标签数据的多维数组

    DataFrame对象的创建

    通用结构: pd.DataFrame(data, index=index, columns=columns)

    data:数据,可以是列表,字典或Numpy数组

    index:索引,为可选参数

    columns: 列标签,为可选参数

    1、通过Series对象创建

    population_dict = {"BeiJing": 2154,
                       "ShangHai": 2424,
                       "ShenZhen": 1303,
                       "HangZhou": 981 }
    
    population = pd.Series(population_dict)    
    pd.DataFrame(population)
    
    0
    BeiJing 2154
    ShangHai 2424
    ShenZhen 1303
    HangZhou 981
    pd.DataFrame(population, columns=["population"])
    
    population
    BeiJing 2154
    ShangHai 2424
    ShenZhen 1303
    HangZhou 981

    2、通过Series对象字典创建

    GDP_dict = {"BeiJing": 30320,
                "ShangHai": 32680,
                "ShenZhen": 24222,
                "HangZhou": 13468 }
    
    GDP = pd.Series(GDP_dict)
    GDP
    
    BeiJing     30320
    ShangHai    32680
    ShenZhen    24222
    HangZhou    13468
    dtype: int64
    
    pd.DataFrame({"population": population,
                  "GDP": GDP})
    
    population GDP
    BeiJing 2154 30320
    ShangHai 2424 32680
    ShenZhen 1303 24222
    HangZhou 981 13468

    注意:数量不够的会自动补齐

    pd.DataFrame({"population": population,
                  "GDP": GDP,
                  "country": "China"})
    
    population GDP country
    BeiJing 2154 30320 China
    ShangHai 2424 32680 China
    ShenZhen 1303 24222 China
    HangZhou 981 13468 China

    3、通过字典列表对象创建

  • 字典索引作为index,字典键作为columns
  • import numpy as np
    import pandas as pd
    
    data = [{"a": i, "b": 2*i} for i in range(3)]
    data
    
    [{'a': 0, 'b': 0}, {'a': 1, 'b': 2}, {'a': 2, 'b': 4}]
    
    data = pd.DataFrame(data)
    data
    
    a b
    0 0 0
    1 1 2
    2 2 4

    行的标签没有排,因此行从0开始,列的标签延续。

  • 从中取出一列数据
  • data1 = data["a"].copy()
    data1
    
    0    0
    1    1
    2    2
    Name: a, dtype: int64
    
    data1[0] = 10
    data1
    
    0    10
    1     1
    2     2
    Name: a, dtype: int64
    
    data
    
    a b
    0 0 0
    1 1 2
    2 2 4
  • 不存在的键,会默认值为NaN
  • data = [{"a": 1, "b":1},{"b": 3, "c":4}]
    data
    
    [{'a': 1, 'b': 1}, {'b': 3, 'c': 4}]
    
    pd.DataFrame(data)
    
    a b c
    0 1.0 1 NaN
    1 NaN 3 4.0

    4、通过Numpy二维数组创建

    data = np.random.randint(10, size=(3, 2))
    data
    
    array([[1, 6],
           [2, 9],
           [4, 0]])
    
    pd.DataFrame(data, columns=["foo", "bar"], index=["a", "b", "c"])
    
    foo bar
    a 1 6
    b 2 9
    c 4 0

    11.2 DataFrame性质

    1、属性

    data = pd.DataFrame({"pop": population, "GDP": GDP})
    data
    
    pop GDP
    BeiJing 2154 30320
    ShangHai 2424 32680
    ShenZhen 1303 24222
    HangZhou 981 13468

    (1)df.values 返回numpy数组表示的数据

    data.values
    
    array([[ 2154, 30320],
           [ 2424, 32680],
           [ 1303, 24222],
           [  981, 13468]], dtype=int64)
    

    (2)df.index 返回行索引

    data.index
    
    Index(['BeiJing', 'ShangHai', 'ShenZhen', 'HangZhou'], dtype='object')
    

    (3)df.columns 返回列索引

    data.columns
    
    Index(['pop', 'GDP'], dtype='object')
    

    (4)df.shape 形状

    data.shape
    
    (4, 2)
    

    (5) pd.size 大小

    data.size
    
    8
    

    (6)pd.dtypes 返回每列数据类型

    data.dtypes
    
    pop    int64
    GDP    int64
    dtype: object
    

    2、索引

    data
    
    pop GDP
    BeiJing 2154 30320
    ShangHai 2424 32680
    ShenZhen 1303 24222
    HangZhou 981 13468

    (1)获取列

  • 字典式
  • data["pop"]
    
    BeiJing     2154
    ShangHai    2424
    ShenZhen    1303
    HangZhou     981
    Name: pop, dtype: int64
    
    data[["GDP", "pop"]]
    
    GDP pop
    BeiJing 30320 2154
    ShangHai 32680 2424
    ShenZhen 24222 1303
    HangZhou 13468 981
  • 对象属性式
  • data.GDP
    
    BeiJing     30320
    ShangHai    32680
    ShenZhen    24222
    HangZhou    13468
    Name: GDP, dtype: int64
    

    (2)获取行

  • 绝对索引 df.loc
  • data.loc["BeiJing"]
    
    pop     2154
    GDP    30320
    Name: BeiJing, dtype: int64
    
    data.loc[["BeiJing", "HangZhou"]]
    
    pop GDP
    BeiJing 2154 30320
    HangZhou 981 13468
  • 相对索引 df.iloc
  • data
    
    pop GDP
    BeiJing 2154 30320
    ShangHai 2424 32680
    ShenZhen 1303 24222
    HangZhou 981 13468
    data.iloc[0]
    
    pop     2154
    GDP    30320
    Name: BeiJing, dtype: int64
    
    data.iloc[[1, 3]]
    
    pop GDP
    ShangHai 2424 32680
    HangZhou 981 13468

    (3)获取标量

    data
    
    pop GDP
    BeiJing 2154 30320
    ShangHai 2424 32680
    ShenZhen 1303 24222
    HangZhou 981 13468
    data.loc["BeiJing", "GDP"]
    
    30320
    
    data.iloc[0, 1]
    
    30320
    
    data.values[0][1]
    
    30320
    

    (4)Series对象的索引

    type(data.GDP)
    
    pandas.core.series.Series
    
    GDP
    
    BeiJing     30320
    ShangHai    32680
    ShenZhen    24222
    HangZhou    13468
    dtype: int64
    
    GDP["BeiJing"]
    
    30320
    

    3、切片

    dates = pd.date_range(start='2019-01-01', periods=6)
    dates
    
    DatetimeIndex(['2019-01-01', '2019-01-02', '2019-01-03', '2019-01-04',
                   '2019-01-05', '2019-01-06'],
                  dtype='datetime64[ns]', freq='D')
    
    df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=["A", "B", "C", "D"])
    df
    
    A B C D
    2019-01-01 -0.935378 -0.190742 0.925984 -0.818969
    2019-01-02 -0.234414 -1.194674 1.080779 -2.294395
    2019-01-03 -0.141572 0.058118 1.102248 1.207726
    2019-01-04 0.305088 0.535920 -0.978434 0.177251
    2019-01-05 0.313383 0.234041 0.163155 -0.296649
    2019-01-06 0.250613 -0.904400 -0.858240 -1.573342

    (1)行切片

    df["2019-01-01": "2019-01-03"]
    
    A B C D
    2019-01-01 -0.935378 -0.190742 0.925984 -0.818969
    2019-01-02 -0.234414 -1.194674 1.080779 -2.294395
    2019-01-03 -0.141572 0.058118 1.102248 1.207726
    df.loc["2019-01-01": "2019-01-03"]
    
    A B C D
    2019-01-01 -0.935378 -0.190742 0.925984 -0.818969
    2019-01-02 -0.234414 -1.194674 1.080779 -2.294395
    2019-01-03 -0.141572 0.058118 1.102248 1.207726
    df.iloc[0: 3]
    
    A B C D
    2019-01-01 -0.935378 -0.190742 0.925984 -0.818969
    2019-01-02 -0.234414 -1.194674 1.080779 -2.294395
    2019-01-03 -0.141572 0.058118 1.102248 1.207726

    注意:这里的3是取不到的。

    (2)列切片

    df
    
    A B C D
    2019-01-01 -0.935378 -0.190742 0.925984 -0.818969
    2019-01-02 -0.234414 -1.194674 1.080779 -2.294395
    2019-01-03 -0.141572 0.058118 1.102248 1.207726
    2019-01-04 0.305088 0.535920 -0.978434 0.177251
    2019-01-05 0.313383 0.234041 0.163155 -0.296649
    2019-01-06 0.250613 -0.904400 -0.858240 -1.573342
    df.loc[:, "A": "C"]
    
    A B C
    2019-01-01 -0.935378 -0.190742 0.925984
    2019-01-02 -0.234414 -1.194674 1.080779
    2019-01-03 -0.141572 0.058118 1.102248
    2019-01-04 0.305088 0.535920 -0.978434
    2019-01-05 0.313383 0.234041 0.163155
    2019-01-06 0.250613 -0.904400 -0.858240
    df.iloc[:, 0: 3]
    
    A B C
    2019-01-01 -0.935378 -0.190742 0.925984
    2019-01-02 -0.234414 -1.194674 1.080779
    2019-01-03 -0.141572 0.058118 1.102248
    2019-01-04 0.305088 0.535920 -0.978434
    2019-01-05 0.313383 0.234041 0.163155
    2019-01-06 0.250613 -0.904400 -0.858240

    (3)多种多样的取值

    df
    
    A B C D
    2019-01-01 -0.935378 -0.190742 0.925984 -0.818969
    2019-01-02 -0.234414 -1.194674 1.080779 -2.294395
    2019-01-03 -0.141572 0.058118 1.102248 1.207726
    2019-01-04 0.305088 0.535920 -0.978434 0.177251
    2019-01-05 0.313383 0.234041 0.163155 -0.296649
    2019-01-06 0.250613 -0.904400 -0.858240 -1.573342
  • 行、列同时切片
  • df.loc["2019-01-02": "2019-01-03", "C":"D"]
    
    C D
    2019-01-02 1.080779 -2.294395
    2019-01-03 1.102248 1.207726
    df.iloc[1: 3, 2:]
    
    C D
    2019-01-02 1.080779 -2.294395
    2019-01-03 1.102248 1.207726
  • 行切片,列分散取值
  • df.loc["2019-01-04": "2019-01-06", ["A", "C"]]
    
    A C
    2019-01-04 0.305088 -0.978434
    2019-01-05 0.313383 0.163155
    2019-01-06 0.250613 -0.858240
    df.iloc[3:, [0, 2]]
    
    A C
    2019-01-04 0.305088 -0.978434
    2019-01-05 0.313383 0.163155
    2019-01-06 0.250613 -0.858240
  • 行分散取值,列切片
  • df.loc[["2019-01-02", "2019-01-06"], "C": "D"]
    

    上面这种方式是行不通的。

    df.iloc[[1, 5], 0: 3]
    
    A B C
    2019-01-02 -0.234414 -1.194674 1.080779
    2019-01-06 0.250613 -0.904400 -0.858240
  • 行、列均分散取值
  • df.loc[["2019-01-04", "2019-01-06"], ["A", "D"]]
    

    同样,上面这种方式是行不通的。

    df.iloc[[1, 5], [0, 3]]
    
    A D
    2019-01-02 -0.234414 -2.294395
    2019-01-06 0.250613 -1.573342

    4、布尔索引

    相当于numpy当中的掩码操作。

    df
    
    A B C D
    2019-01-01 -0.935378 -0.190742 0.925984 -0.818969
    2019-01-02 -0.234414 -1.194674 1.080779 -2.294395
    2019-01-03 -0.141572 0.058118 1.102248 1.207726
    2019-01-04 0.305088 0.535920 -0.978434 0.177251
    2019-01-05 0.313383 0.234041 0.163155 -0.296649
    2019-01-06 0.250613 -0.904400 -0.858240 -1.573342
    df > 0
    
    A B C D
    2019-01-01 False False True False
    2019-01-02 False False True False
    2019-01-03 False True True True
    2019-01-04 True True False True
    2019-01-05 True True True False
    2019-01-06 True False False False
    df[df > 0]
    
    A B C D
    2019-01-01 NaN NaN 0.925984 NaN
    2019-01-02 NaN NaN 1.080779 NaN
    2019-01-03 NaN 0.058118 1.102248 1.207726
    2019-01-04 0.305088 0.535920 NaN 0.177251
    2019-01-05 0.313383 0.234041 0.163155 NaN
    2019-01-06 0.250613 NaN NaN NaN

    可以观察到,为true的部分都被取到了,而false没有。

    df.A > 0
    
    2019-01-01    False
    2019-01-02    False
    2019-01-03    False
    2019-01-04     True
    2019-01-05     True
    2019-01-06     True
    Freq: D, Name: A, dtype: bool
    
    df[df.A > 0]
    
    A B C D
    2019-01-04 0.305088 0.535920 -0.978434 0.177251
    2019-01-05 0.313383 0.234041 0.163155 -0.296649
    2019-01-06 0.250613 -0.904400 -0.858240 -1.573342
  • isin()方法
  • df2 = df.copy()
    df2['E'] = ['one', 'one', 'two', 'three', 'four', 'three']
    df2
    
    A B C D E
    2019-01-01 -0.935378 -0.190742 0.925984 -0.818969 one
    2019-01-02 -0.234414 -1.194674 1.080779 -2.294395 one
    2019-01-03 -0.141572 0.058118 1.102248 1.207726 two
    2019-01-04 0.305088 0.535920 -0.978434 0.177251 three
    2019-01-05 0.313383 0.234041 0.163155 -0.296649 four
    2019-01-06 0.250613 -0.904400 -0.858240 -1.573342 three
    ind = df2["E"].isin(["two", "four"])
    ind     
    
    2019-01-01    False
    2019-01-02    False
    2019-01-03     True
    2019-01-04    False
    2019-01-05     True
    2019-01-06    False
    Freq: D, Name: E, dtype: bool
    
    df2[ind]
    
    A B C D E
    2019-01-03 -0.141572 0.058118 1.102248 1.207726 two
    2019-01-05 0.313383 0.234041 0.163155 -0.296649 four

    (5)赋值

    df
    
  • DataFrame 增加新列
  • s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range('20190101', periods=6))
    s1
    
    2019-01-01    1
    2019-01-02    2
    2019-01-03    3
    2019-01-04    4
    2019-01-05    5
    2019-01-06    6
    Freq: D, dtype: int64
    
    df["E"] = s1
    df
    
    A B C D E
    2019-01-01 -0.935378 -0.190742 0.925984 -0.818969 1
    2019-01-02 -0.234414 -1.194674 1.080779 -2.294395 2
    2019-01-03 -0.141572 0.058118 1.102248 1.207726 3
    2019-01-04 0.305088 0.535920 -0.978434 0.177251 4
    2019-01-05 0.313383 0.234041 0.163155 -0.296649 5
    2019-01-06 0.250613 -0.904400 -0.858240 -1.573342 6
  • 修改赋值
  • df.loc["2019-01-01", "A"] = 0
    df
    
    A B C D E
    2019-01-01 0.000000 -0.190742 0.925984 -0.818969 1
    2019-01-02 -0.234414 -1.194674 1.080779 -2.294395 2
    2019-01-03 -0.141572 0.058118 1.102248 1.207726 3
    2019-01-04 0.305088 0.535920 -0.978434 0.177251 4
    2019-01-05 0.313383 0.234041 0.163155 -0.296649 5
    2019-01-06 0.250613 -0.904400 -0.858240 -1.573342 6
    df.iloc[0, 1] = 0
    df
    
    A B C D E
    2019-01-01 0.000000 0.000000 0.925984 -0.818969 1
    2019-01-02 -0.234414 -1.194674 1.080779 -2.294395 2
    2019-01-03 -0.141572 0.058118 1.102248 1.207726 3
    2019-01-04 0.305088 0.535920 -0.978434 0.177251 4
    2019-01-05 0.313383 0.234041 0.163155 -0.296649 5
    2019-01-06 0.250613 -0.904400 -0.858240 -1.573342 6
    df["D"] = np.array([5]*len(df))   # 可简化成df["D"] = 5
    df
    
    A B C D E
    2019-01-01 0.000000 0.000000 0.925984 5 1
    2019-01-02 -0.234414 -1.194674 1.080779 5 2
    2019-01-03 -0.141572 0.058118 1.102248 5 3
    2019-01-04 0.305088 0.535920 -0.978434 5 4
    2019-01-05 0.313383 0.234041 0.163155 5 5
    2019-01-06 0.250613 -0.904400 -0.858240 5 6
  • 修改index和columns
  • df.index = [i for i in range(len(df))]
    df
    
    A B C D E
    0 0.000000 0.000000 0.925984 5 1
    1 -0.234414 -1.194674 1.080779 5 2
    2 -0.141572 0.058118 1.102248 5 3
    3 0.305088 0.535920 -0.978434 5 4
    4 0.313383 0.234041 0.163155 5 5
    5 0.250613 -0.904400 -0.858240 5 6
    df.columns = [i for i in range(df.shape[1])]
    df
    
    0 1 2 3 4
    0 0.000000 0.000000 0.925984 5 1
    1 -0.234414 -1.194674 1.080779 5 2
    2 -0.141572 0.058118 1.102248 5 3
    3 0.305088 0.535920 -0.978434 5 4
    4 0.313383 0.234041 0.163155 5 5
    5 0.250613 -0.904400 -0.858240 5 6

    11.3 数值运算及统计分析

    image-20221002211052367

    1、数据的查看

    import pandas as pd
    import numpy as np
    
    dates = pd.date_range(start='2019-01-01', periods=6)
    df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=["A", "B", "C", "D"])
    df
    
    A B C D
    2019-01-01 -0.854043 0.412345 -2.296051 -0.048964
    2019-01-02 1.371364 -0.121454 -0.299653 1.095375
    2019-01-03 -0.714591 -1.103224 0.979250 0.319455
    2019-01-04 -1.397557 0.426008 0.233861 -1.651887
    2019-01-05 0.434026 0.459830 -0.095444 1.220302
    2019-01-06 -0.133876 0.074500 -1.028147 0.605402

    (1)查看前面的行

    df.head()    # 默认5行,也可以进行设置
    
    A B C D
    2019-01-01 -0.854043 0.412345 -2.296051 -0.048964
    2019-01-02 1.371364 -0.121454 -0.299653 1.095375
    2019-01-03 -0.714591 -1.103224 0.979250 0.319455
    2019-01-04 -1.397557 0.426008 0.233861 -1.651887
    2019-01-05 0.434026 0.459830 -0.095444 1.220302
    df.head(2)
    
    A B C D
    2019-01-01 -0.854043 0.412345 -2.296051 -0.048964
    2019-01-02 1.371364 -0.121454 -0.299653 1.095375

    (2)查看后面的行

    df.tail()    # 默认5行
    
    A B C D
    2019-01-02 1.371364 -0.121454 -0.299653 1.095375
    2019-01-03 -0.714591 -1.103224 0.979250 0.319455
    2019-01-04 -1.397557 0.426008 0.233861 -1.651887
    2019-01-05 0.434026 0.459830 -0.095444 1.220302
    2019-01-06 -0.133876 0.074500 -1.028147 0.605402
    df.tail(3) 
    
    A B C D
    2019-01-04 -1.397557 0.426008 0.233861 -1.651887
    2019-01-05 0.434026 0.459830 -0.095444 1.220302
    2019-01-06 -0.133876 0.074500 -1.028147 0.605402

    (3)查看总体信息

    df.iloc[0, 3] = np.nan
    df
    
    A B C D
    2019-01-01 -0.854043 0.412345 -2.296051 NaN
    2019-01-02 1.371364 -0.121454 -0.299653 1.095375
    2019-01-03 -0.714591 -1.103224 0.979250 0.319455
    2019-01-04 -1.397557 0.426008 0.233861 -1.651887
    2019-01-05 0.434026 0.459830 -0.095444 1.220302
    2019-01-06 -0.133876 0.074500 -1.028147 0.605402
    df.info()
    
    <class 'pandas.core.frame.DataFrame'>
    DatetimeIndex: 6 entries, 2019-01-01 to 2019-01-06
    Freq: D
    Data columns (total 4 columns):
    A    6 non-null float64
    B    6 non-null float64
    C    6 non-null float64
    D    5 non-null float64
    dtypes: float64(4)
    memory usage: 240.0 bytes
    

    2、Numpy通用函数同样适用于Pandas

    (1)向量化运算

    x = pd.DataFrame(np.arange(4).reshape(1, 4))
    x
    
    0 1 2 3
    0 0 1 2 3
    x+5
    
    0 1 2 3
    0 5 6 7 8
    np.exp(x)
    
    0 1 2 3
    0 1.0 2.718282 7.389056 20.085537
    y = pd.DataFrame(np.arange(4,8).reshape(1, 4))
    y
    
    0 1 2 3
    0 4 5 6 7
    x*y
    
    0 1 2 3
    0 0 5 12 21

    (2)矩阵化运算

    np.random.seed(42)
    x = pd.DataFrame(np.random.randint(10, size=(30, 30)))
    x
    
    0 1 2 3 4 5 6 7 8 9 20 21 22 23 24 25 26 27 28 29
    0 6 3 7 4 6 9 2 6 7 4 4 0 9 5 8 0 9 2 6 3
    1 8 2 4 2 6 4 8 6 1 3 2 0 3 1 7 3 1 5 5 9
    2 3 5 1 9 1 9 3 7 6 8 6 8 7 0 7 7 2 0 7 2
    3 2 0 4 9 6 9 8 6 8 7 0 2 4 2 0 4 9 6 6 8
    4 9 9 2 6 0 3 3 4 6 6 9 6 8 6 0 0 8 8 3 8
    5 2 6 5 7 8 4 0 2 9 7 2 0 4 0 7 0 0 1 1 5
    6 6 4 0 0 2 1 4 9 5 6 5 0 8 5 2 3 3 2 9 2
    7 2 3 6 3 8 0 7 6 1 7 3 0 1 0 4 4 6 8 8 2
    8 2 2 3 7 5 7 0 7 3 0 1 1 5 2 8 3 0 3 0 4
    9 3 7 7 6 2 0 0 2 5 6 4 2 3 2 0 0 4 5 2 8
    10 4 7 0 4 2 0 3 4 6 0 5 6 1 9 1 9 0 7 0 8
    11 5 6 9 6 9 2 1 8 7 9 6 5 2 8 9 5 9 9 5 0
    12 3 9 5 5 4 0 7 4 4 6 0 7 2 9 6 9 4 9 4 6
    13 8 4 0 9 9 0 1 5 8 7 5 8 4 0 3 4 9 9 4 6
    14 3 0 4 6 9 9 5 4 3 1 6 1 0 3 7 1 2 0 0 2
    15 4 2 0 0 7 9 1 2 1 2 6 3 9 4 1 7 3 8 4 8
    16 3 9 4 8 7 2 0 2 3 1 8 0 0 3 8 5 2 0 3 8
    17 2 8 6 3 2 9 4 4 2 8 6 9 4 2 6 1 8 9 9 0
    18 5 6 7 9 8 1 9 1 4 4 3 5 2 5 6 9 9 2 6 2
    19 1 9 3 7 8 6 0 2 8 0 4 3 2 2 3 8 1 8 0 0
    20 4 5 5 2 6 8 9 7 5 7 3 5 0 8 0 4 3 2 5 1
    21 2 4 8 1 9 7 1 4 6 7 0 1 8 2 0 4 6 5 0 4
    22 4 5 2 4 6 4 4 4 9 9 1 7 6 9 9 1 5 5 2 1
    23 0 5 4 8 0 6 4 4 1 2 8 5 0 7 6 9 2 0 4 3
    24 9 7 0 9 0 3 7 4 1 5 3 7 8 2 2 1 9 2 2 4
    25 4 1 9 5 4 5 0 4 8 9 9 3 0 7 0 2 3 7 5 9
    26 6 7 1 9 7 2 6 2 6 1 0 6 5 9 8 0 3 8 3 9
    27 2 8 1 3 5 1 7 7 0 2 8 0 4 5 4 5 5 6 3 7
    28 6 8 6 2 2 7 4 3 7 5 1 7 9 2 4 5 9 5 3 2
    29 3 0 3 0 0 9 5 4 3 2 1 3 0 4 8 0 8 7 5 6

    30 rows × 30 columns

  • 转置
  • z = x.T
    z
    
    0 1 2 3 4 5 6 7 8 9 20 21 22 23 24 25 26 27 28 29
    0 6 8 3 2 9 2 6 2 2 3 4 2 4 0 9 4 6 2 6 3
    1 3 2 5 0 9 6 4 3 2 7 5 4 5 5 7 1 7 8 8 0
    2 7 4 1 4 2 5 0 6 3 7 5 8 2 4 0 9 1 1 6 3
    3 4 2 9 9 6 7 0 3 7 6 2 1 4 8 9 5 9 3 2 0
    4 6 6 1 6 0 8 2 8 5 2 6 9 6 0 0 4 7 5 2 0
    5 9 4 9 9 3 4 1 0 7 0 8 7 4 6 3 5 2 1 7 9
    6 2 8 3 8 3 0 4 7 0 0 9 1 4 4 7 0 6 7 4 5
    7 6 6 7 6 4 2 9 6 7 2 7 4 4 4 4 4 2 7 3 4
    8 7 1 6 8 6 9 5 1 3 5 5 6 9 1 1 8 6 0 7 3
    9 4 3 8 7 6 7 6 7 0 6 7 7 9 2 5 9 1 2 5 2
    10 3 8 7 1 3 5 3 0 7 5 4 0 2 6 4 1 9 9 1 0
    11 7 1 4 0 6 7 6 8 3 5 7 5 0 5 1 0 5 8 3 5
    12 7 9 1 6 2 8 7 8 5 5 9 0 4 1 2 9 2 4 3 1
    13 2 8 4 6 5 3 0 1 7 2 3 1 8 5 8 8 2 5 5 7
    14 5 9 7 7 1 0 5 6 3 5 9 0 0 1 6 9 8 3 5 9
    15 4 4 9 4 9 0 7 9 2 7 7 4 2 1 6 8 6 9 0 4
    16 1 1 8 2 8 9 4 2 8 1 9 9 3 1 5 8 4 1 7 6
    17 7 3 8 7 4 3 3 6 2 4 1 8 0 2 7 5 9 7 5 9
    18 5 6 0 5 5 6 1 9 8 0 4 5 0 1 3 7 6 5 2 1
    19 1 7 8 2 3 1 5 8 1 0 8 0 7 3 7 0 8 4 8 7
    20 4 2 6 0 9 2 5 3 1 4 3 0 1 8 3 9 0 8 1 1
    21 0 0 8 2 6 0 0 0 1 2 5 1 7 5 7 3 6 0 7 3
    22 9 3 7 4 8 4 8 1 5 3 0 8 6 0 8 0 5 4 9 0
    23 5 1 0 2 6 0 5 0 2 2 8 2 9 7 2 7 9 5 2 4
    24 8 7 7 0 0 7 2 4 8 0 0 0 9 6 2 0 8 4 4 8
    25 0 3 7 4 0 0 3 4 3 0 4 4 1 9 1 2 0 5 5 0
    26 9 1 2 9 8 0 3 6 0 4 3 6 5 2 9 3 3 5 9 8
    27 2 5 0 6 8 1 2 8 3 5 2 5 5 0 2 7 8 6 5 7
    28 6 5 7 6 3 1 9 8 0 2 5 0 2 4 2 5 3 3 3 5
    29 3 9 2 8 8 5 2 2 4 8 1 4 1 3 4 9 9 7 2 6

    30 rows × 30 columns

    np.random.seed(1)
    y = pd.DataFrame(np.random.randint(10, size=(30, 30)))
    y
    
    0 1 2 3 4 5 6 7 8 9 20 21 22 23 24 25 26 27 28 29
    0 5 8 9 5 0 0 1 7 6 9 1 7 0 6 9 9 7 6 9 1
    1 0 1 8 8 3 9 8 7 3 6 9 2 0 4 9 2 7 7 9 8
    2 6 9 3 7 7 4 5 9 3 6 7 7 1 1 3 0 8 6 4 5
    3 6 2 5 7 8 4 4 7 7 4 0 1 9 8 2 3 1 2 7 2
    4 6 0 9 2 6 6 2 7 7 0 1 5 4 0 7 8 9 5 7 0
    5 9 3 9 1 4 4 6 8 8 9 1 8 7 0 3 4 2 0 3 5
    6 1 2 4 3 0 6 0 7 2 8 4 3 3 6 7 3 5 3 2 4
    7 4 0 3 3 8 3 5 6 7 5 1 7 3 1 6 6 9 6 9 6
    8 0 0 2 9 6 0 6 7 0 3 6 7 9 5 4 9 5 2 5 6
    9 6 8 7 7 7 2 6 0 5 2 7 0 6 2 4 3 6 7 6 3
    10 0 6 4 7 6 2 9 5 9 9 4 9 3 9 1 2 5 4 0 8
    11 2 3 9 9 4 4 8 2 1 6 0 5 9 8 6 6 0 4 7 3
    12 0 1 6 0 6 1 6 4 2 5 8 8 0 7 2 0 7 1 1 9
    13 5 1 5 9 6 4 9 8 7 5 2 4 3 2 0 0 4 2 5 0
    14 0 3 8 5 3 1 4 7 3 2 8 5 5 7 5 9 1 3 9 3
    15 3 3 6 1 3 0 5 0 5 2 7 1 7 7 3 8 3 0 6 3
    16 0 6 5 9 6 4 6 6 2 2 3 6 8 6 5 1 3 2 6 3
    17 6 7 2 8 0 1 8 6 0 0 5 6 2 5 4 3 0 6 2 1
    18 9 4 4 0 9 8 7 7 6 1 7 9 9 7 1 1 4 6 5 6
    19 4 1 1 5 1 2 6 2 3 3 0 0 0 9 8 5 9 3 4 0
    20 9 8 6 3 9 9 0 8 1 6 2 9 0 1 3 9 4 8 8 8
    21 2 8 6 4 9 0 5 5 6 1 6 7 5 6 8 7 4 2 4 0
    22 0 3 5 9 0 3 6 5 1 1 6 2 5 3 9 3 9 5 1 9
    23 7 7 0 8 6 1 2 0 4 4 1 9 6 0 2 8 3 7 2 5
    24 6 0 4 2 3 1 0 5 7 0 1 1 2 7 5 2 9 4 7 3
    25 5 0 2 1 4 9 4 6 9 3 5 5 3 5 9 2 7 4 1 6
    26 9 8 1 8 1 6 2 6 1 8 2 5 1 2 5 3 3 6 1 8
    27 1 8 6 4 6 9 5 4 7 2 9 3 1 5 1 1 7 1 2 6
    28 0 7 7 4 3 2 7 8 5 2 0 2 8 3 7 3 9 2 3 8
    29 8 0 2 6 8 3 6 4 9 7 6 7 8 5 7 2 5 3 4 5

    30 rows × 30 columns

    x.dot(y)
    
    0 1 2 3 4 5 6 7 8 9 20 21 22 23 24 25 26 27 28 29
    0 616 560 723 739 612 457 681 799 575 590 523 739 613 580 668 602 733 585 657 700
    1 520 438 691 600 612 455 666 764 707 592 555 681 503 679 641 506 779 494 633 590
    2 557 570 786 807 690 469 804 828 704 573 563 675 712 758 793 672 754 550 756 638
    3 605 507 664 701 660 496 698 806 651 575 582 685 668 586 629 534 678 484 591 626
    4 599 681 753 873 721 563 754 770 620 654 633 747 661 677 726 649 716 610 735 706
    5 422 354 602 627 613 396 617 627 489 423 456 572 559 537 499 384 589 436 574 507
    6 359 446 599 599 481 357 577 572 451 464 449 550 495 532 633 554 663 476 565 602
    7 531 520 698 590 607 537 665 696 571 472 576 588 551 665 652 527 742 528 650 599
    8 449 322 547 533 593 399 584 638 587 424 402 596 523 523 447 362 561 386 529 484
    9 373 433 525 601 522 345 551 521 434 447 508 498 438 478 459 418 488 407 503 496
    10 500 427 574 607 667 477 652 656 615 477 622 702 531 610 558 532 598 471 582 561
    11 664 694 772 841 779 574 730 810 711 608 591 760 616 638 721 676 846 678 754 708
    12 545 547 687 701 721 576 689 724 710 532 674 684 648 694 710 564 757 571 671 656
    13 574 586 723 750 691 494 696 787 667 523 618 681 568 682 715 644 756 557 690 604
    14 502 382 645 557 570 403 538 677 500 501 369 650 507 576 546 531 554 437 616 463
    15 510 505 736 651 649 510 719 733 694 557 605 717 574 642 678 576 755 455 598 654
    16 567 376 614 612 643 514 598 724 547 464 456 639 520 560 569 442 596 517 659 532
    17 626 716 828 765 740 603 809 852 692 591 664 716 655 721 742 612 819 593 744 712
    18 600 559 667 664 641 556 624 815 638 564 581 701 559 677 710 554 748 597 614 657
    19 445 431 661 681 641 552 690 719 602 474 515 637 576 620 572 512 599 455 622 538
    20 523 569 784 725 713 501 740 772 638 640 589 775 664 686 726 672 747 548 723 645
    21 487 465 553 639 517 449 592 609 454 398 492 567 534 404 554 417 561 466 498 492
    22 479 449 574 686 583 377 566 614 563 455 453 539 491 501 596 520 722 478 565 501
    23 483 386 476 526 550 426 492 585 536 482 322 541 438 456 487 408 502 426 474 481
    24 523 551 658 767 537 444 663 731 576 577 522 590 525 664 691 548 635 526 641 538
    25 652 656 738 753 853 508 752 815 669 576 694 833 693 606 575 616 704 559 728 672
    26 578 577 744 856 699 497 779 800 733 587 630 754 704 834 760 680 765 592 731 629
    27 554 494 665 689 630 574 695 703 636 599 554 685 532 658 649 554 693 577 634 668
    28 498 552 659 784 552 492 690 775 544 551 567 636 518 599 742 521 733 533 605 604
    29 513 491 563 642 477 367 589 647 516 484 428 574 504 548 553 483 540 407 547 455

    30 rows × 30 columns

    %timeit x.dot(y)
    
    218 µs ± 18.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    
    %timeit np.dot(x, y)
    
    81.1 µs ± 2.85 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    
  • 执行相同运算,Numpy与Pandas的对比
  • x1 = np.array(x)
    x1
    
    y1 = np.array(y)
    y1
    
    %timeit x1.dot(y1)
    
    22.1 µs ± 992 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    
    %timeit np.dot(x1, y1)
    
    22.6 µs ± 766 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    
    %timeit np.dot(x.values, y.values)
    
    42.9 µs ± 1.24 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    
    x2 = list(x1)
    y2 = list(y1)
    x3 = []
    y3 = []
    for i in x2:
        res = []
        for j in i:
            res.append(int(j))
        x3.append(res)
    for i in y2:
        res = []
        for j in i:
            res.append(int(j))
        y3.append(res)
    
    def f(x, y):
        res = []
        for i in range(len(x)):
            row = []
            for j in range(len(y[0])):
                sum_row = 0
                for k in range(len(x[0])):
                    sum_row += x[i][k]*y[k][j]
                row.append(sum_row)
            res.append(row)
        return res          
    
    %timeit f(x3, y3)
    
    4.29 ms ± 207 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    

    一般来说,纯粹的计算在Numpy里执行的更快

    Numpy更侧重于计算,Pandas更侧重于数据处理

    (3)广播运算

    np.random.seed(42)
    x = pd.DataFrame(np.random.randint(10, size=(3, 3)), columns=list("ABC"))
    x
    
    A B C
    0 6 3 7
    1 4 6 9
    2 2 6 7
  • 按行广播
  • x.iloc[0]
    
    A    6
    B    3
    C    7
    Name: 0, dtype: int32
    
    x/x.iloc[0]
    
    A B C
    0 1.000000 1.0 1.000000
    1 0.666667 2.0 1.285714
    2 0.333333 2.0 1.000000
  • 按列广播
  • x.A
    
    0    6
    1    4
    2    2
    Name: A, dtype: int32
    
    x.div(x.A, axis=0)             # add sub div mul
    
    A B C
    0 1.0 0.5 1.166667
    1 1.0 1.5 2.250000
    2 1.0 3.0 3.500000
    x.div(x.iloc[0], axis=1)
    
    A B C
    0 1.000000 1.0 1.000000
    1 0.666667 2.0 1.285714
    2 0.333333 2.0 1.000000

    3、新的用法

    (1)索引对齐

    A = pd.DataFrame(np.random.randint(0, 20, size=(2, 2)), columns=list("AB"))
    A
    
    A B
    0 3 7
    1 2 1
    B = pd.DataFrame(np.random.randint(0, 10, size=(3, 3)), columns=list("ABC"))
    B
    
    A B C
    0 7 5 1
    1 4 0 9
    2 5 8 0
  • pandas会自动对齐两个对象的索引,没有的值用np.nan表示
  • A+B
    
    A B C
    0 10.0 12.0 NaN
    1 6.0 1.0 NaN
    2 NaN NaN NaN
  • 缺省值也可用fill_value来填充
  • A.add(B, fill_value=0)
    
    A B C
    0 10.0 12.0 1.0
    1 6.0 1.0 9.0
    2 5.0 8.0 0.0
    A*B
    
    A B C
    0 21.0 35.0 NaN
    1 8.0 0.0 NaN
    2 NaN NaN NaN

    (2)统计相关

  • 数据种类统计
  • y = np.random.randint(3, size=20)
    y
    
    array([2, 2, 2, 1, 2, 1, 1, 2, 1, 2, 2, 0, 2, 0, 2, 2, 0, 0, 2, 1])
    
    np.unique(y)
    
    array([0, 1, 2])
    

    用Counter方法统计数据

    from collections import Counter
    Counter(y)
    
    Counter({2: 11, 1: 5, 0: 4})
    
    y1 = pd.DataFrame(y, columns=["A"])
    y1
    
    A
    0 2
    1 2
    2 2
    3 1
    4 2
    5 1
    6 1
    7 2
    8 1
    9 2
    10 2
    11 0
    12 2
    13 0
    14 2
    15 2
    16 0
    17 0
    18 2
    19 1

    np.unique(y1)

    有value counter的方法

    y1["A"].value_counts()
    
    2    11
    1     5
    0     4
    Name: A, dtype: int64
    
  • 产生新的结果,并进行排序
  • population_dict = {"BeiJing": 2154,
                       "ShangHai": 2424,
                       "ShenZhen": 1303,
                       "HangZhou": 981 }
    population = pd.Series(population_dict) 
    
    GDP_dict = {"BeiJing": 30320,
                "ShangHai": 32680,
                "ShenZhen": 24222,
                "HangZhou": 13468 }
    GDP = pd.Series(GDP_dict)
    
    city_info = pd.DataFrame({"population": population,"GDP": GDP})
    city_info
    
    population GDP
    BeiJing 2154 30320
    ShangHai 2424 32680
    ShenZhen 1303 24222
    HangZhou 981 13468
    city_info["per_GDP"] = city_info["GDP"]/city_info["population"]
    city_info
    
    population GDP per_GDP
    BeiJing 2154 30320 14.076137
    ShangHai 2424 32680 13.481848
    ShenZhen 1303 24222 18.589409
    HangZhou 981 13468 13.728848

    递增排序

    city_info.sort_values(by="per_GDP")
    
    population GDP per_GDP
    ShangHai 2424 32680 13.481848
    HangZhou 981 13468 13.728848
    BeiJing 2154 30320 14.076137
    ShenZhen 1303 24222 18.589409

    递减排序

    city_info.sort_values(by="per_GDP", ascending=False)
    
    population GDP per_GDP
    ShenZhen 1303 24222 18.589409
    BeiJing 2154 30320 14.076137
    HangZhou 981 13468 13.728848
    ShangHai 2424 32680 13.481848

    按轴进行排序

    data = pd.DataFrame(np.random.randint(20, size=(3, 4)), index=[2, 1, 0], columns=list("CBAD"))
    data
    
    C B A D
    2 3 13 17 8
    1 1 19 14 6
    0 11 7 14 2

    行排序

    data.sort_index()
    
    C B A D
    0 11 7 14 2
    1 1 19 14 6
    2 3 13 17 8

    列排序

    data.sort_index(axis=1)
    
    A B C D
    2 17 13 3 8
    1 14 19 1 6
    0 14 7 11 2
    data.sort_index(axis=1, ascending=False)
    
    D C B A
    2 8 3 13 17
    1 6 1 19 14
    0 2 11 7 14
  • 统计方法
  • df = pd.DataFrame(np.random.normal(2, 4, size=(6, 4)),columns=list("ABCD"))
    df
    
    A B C D
    0 1.082198 3.557396 -3.060476 6.367969
    1 13.113252 6.774559 2.874553 5.527044
    2 -2.036341 -4.333177 5.094802 -0.152567
    3 -3.386712 -1.522365 -2.522209 2.537716
    4 4.328491 5.550994 5.577329 5.019991
    5 1.171336 -0.493910 -4.032613 6.398588

    非空个数

    df.count()
    
    A    6
    B    6
    C    6
    D    6
    dtype: int64
    

    求和

    df.sum()
    
    A    14.272224
    B     9.533497
    C     3.931385
    D    25.698741
    dtype: float64
    
    df.sum(axis=1)
    
    0     7.947086
    1    28.289408
    2    -1.427283
    3    -4.893571
    4    20.476806
    5     3.043402
    dtype: float64
    

    最大值 最小值

    df.min()
    
    A   -3.386712
    B   -4.333177
    C   -4.032613
    D   -0.152567
    dtype: float64
    
    df.max(axis=1)
    
    0     6.367969
    1    13.113252
    2     5.094802
    3     2.537716
    4     5.577329
    5     6.398588
    dtype: float64
    
    df
    
    A B C D
    0 1.082198 3.557396 -3.060476 6.367969
    1 13.113252 6.774559 2.874553 5.527044
    2 -2.036341 -4.333177 5.094802 -0.152567
    3 -3.386712 -1.522365 -2.522209 2.537716
    4 4.328491 5.550994 5.577329 5.019991
    5 1.171336 -0.493910 -4.032613 6.398588
    df.idxmax()
    
    A    1
    B    1
    C    4
    D    5
    dtype: int64
    

    均值

    df.mean()
    
    A    2.378704
    B    1.588916
    C    0.655231
    D    4.283124
    dtype: float64
    

    方差

    df.var()
    
    A    34.980702
    B    19.110656
    C    18.948144
    D     6.726776
    dtype: float64
    

    标准差

    df.std()
    
    A    5.914449
    B    4.371574
    C    4.352947
    D    2.593603
    dtype: float64
    

    中位数

    df.median()
    
    A    1.126767
    B    1.531743
    C    0.176172
    D    5.273518
    dtype: float64
    

    众数

    data = pd.DataFrame(np.random.randint(5, size=(10, 2)), columns=list("AB"))
    data
    
    A B
    0 4 2
    1 3 2
    2 2 0
    3 2 4
    4 2 0
    5 4 1
    6 2 0
    7 1 1
    8 3 4
    9 2 0
    data.mode()
    
    A B
    0 2 0

    75%分位数

    df.quantile(0.75)
    
    A    3.539202
    B    5.052594
    C    4.539740
    D    6.157738
    Name: 0.75, dtype: float64
    
  • 用describe()可以获取所有属性
  • df.describe()
    
    A B C D
    count 6.000000 6.000000 6.000000 6.000000
    mean 2.378704 1.588916 0.655231 4.283124
    std 5.914449 4.371574 4.352947 2.593603
    min -3.386712 -4.333177 -4.032613 -0.152567
    25% -1.256706 -1.265251 -2.925910 3.158284
    50% 1.126767 1.531743 0.176172 5.273518
    75% 3.539202 5.052594 4.539740 6.157738
    max 13.113252 6.774559 5.577329 6.398588
    data_2 = pd.DataFrame([["a", "a", "c", "d"],
                           ["c", "a", "c", "b"],
                           ["a", "a", "d", "c"]], columns=list("ABCD"))
    data_2
    
    A B C D
    0 a a c d
    1 c a c b
    2 a a d c
  • 字符串类型的describe
  • data_2.describe()
    
    A B C D
    count 3 3 3 3
    unique 2 1 2 3
    top a a c d
    freq 2 3 2 1

    相关性系数和协方差

    df.corr()
    
    A B C D
    A 1.000000 0.831063 0.331060 0.510821
    B 0.831063 1.000000 0.179244 0.719112
    C 0.331060 0.179244 1.000000 -0.450365
    D 0.510821 0.719112 -0.450365 1.000000
    df.corrwith(df["A"])
    
    A    1.000000
    B    0.831063
    C    0.331060
    D    0.510821
    dtype: float64
    

    自定义输出

    apply(method)的用法:使用method方法默认对每一列进行相应的操作

    df
    
    A B C D
    0 1.082198 3.557396 -3.060476 6.367969
    1 13.113252 6.774559 2.874553 5.527044
    2 -2.036341 -4.333177 5.094802 -0.152567
    3 -3.386712 -1.522365 -2.522209 2.537716
    4 4.328491 5.550994 5.577329 5.019991
    5 1.171336 -0.493910 -4.032613 6.398588
    df.apply(np.cumsum)
    
    A B C D
    0 1.082198 3.557396 -3.060476 6.367969
    1 14.195450 10.331955 -0.185923 11.895013
    2 12.159109 5.998778 4.908878 11.742447
    3 8.772397 4.476413 2.386669 14.280162
    4 13.100888 10.027406 7.963999 19.300153
    5 14.272224 9.533497 3.931385 25.698741
    df.apply(np.cumsum, axis=1)
    
    A B C D
    0 1.082198 4.639594 1.579117 7.947086
    1 13.113252 19.887811 22.762364 28.289408
    2 -2.036341 -6.369518 -1.274717 -1.427283
    3 -3.386712 -4.909077 -7.431287 -4.893571
    4 4.328491 9.879485 15.456814 20.476806
    5 1.171336 0.677427 -3.355186 3.043402
    df.apply(sum)
    
    A    14.272224
    B     9.533497
    C     3.931385
    D    25.698741
    dtype: float64
    
    df.sum()
    
    A    14.272224
    B     9.533497
    C     3.931385
    D    25.698741
    dtype: float64
    
    df.apply(lambda x: x.max()-x.min())
    
    A    16.499965
    B    11.107736
    C     9.609942
    D     6.551155
    dtype: float64
    
    def my_describe(x):
        return pd.Series([x.count(), x.mean(), x.max(), x.idxmin(), x.std()], \
                         index=["Count", "mean", "max", "idxmin", "std"])
    df.apply(my_describe)
    
    A B C D
    Count 6.000000 6.000000 6.000000 6.000000
    mean 2.378704 1.588916 0.655231 4.283124
    max 13.113252 6.774559 5.577329 6.398588
    idxmin 3.000000 2.000000 5.000000 2.000000
    std 5.914449 4.371574 4.352947 2.593603

    11.4 缺失值处理

    1、发现缺失值

    import pandas as pd
    import numpy as np
    
    data = pd.DataFrame(np.array([[1, np.nan, 2],
                                  [np.nan, 3, 4],
                                  [5, 6, None]]), columns=["A", "B", "C"])
    data
    
    A B C
    0 1 NaN 2
    1 NaN 3 4
    2 5 6 None

    注意:有None、字符串等,数据类型全部变为object,它比int和float更消耗资源

    np.nan是一个特殊的浮点数,类型是浮点类型,所以表示缺失值时最好使用NaN。

    data.dtypes
    
    A    object
    B    object
    C    object
    dtype: object
    
    data.isnull()
    
    A B C
    0 False True False
    1 True False False
    2 False False True
    data.notnull()
    
    A B C
    0 True False True
    1 False True True
    2 True True False

    2、删除缺失值

    data = pd.DataFrame(np.array([[1, np.nan, 2, 3],
                                  [np.nan, 4, 5, 6],
                                  [7, 8, np.nan, 9],
                                  [10, 11 , 12, 13]]), columns=["A", "B", "C", "D"])
    data
    
    A B C D
    0 1.0 NaN 2.0 3.0
    1 NaN 4.0 5.0 6.0
    2 7.0 8.0 NaN 9.0
    3 10.0 11.0 12.0 13.0

    注意:np.nan是一种特殊的浮点数

    data.dtypes
    
    A    float64
    B    float64
    C    float64
    D    float64
    dtype: object
    

    (1)删除整行

    data.dropna()
    
    A B C D
    3 10.0 11.0 12.0 13.0

    (2)删除整列

    data.dropna(axis="columns")
    
    D
    0 3.0
    1 6.0
    2 9.0
    3 13.0
    data["D"] = np.nan
    data
    
    A B C D
    0 1.0 NaN 2.0 NaN
    1 NaN 4.0 5.0 NaN
    2 7.0 8.0 NaN NaN
    3 10.0 11.0 12.0 NaN
    data.dropna(axis="columns", how="all")
    
    A B C
    0 1.0 NaN 2.0
    1 NaN 4.0 5.0
    2 7.0 8.0 NaN
    3 10.0 11.0 12.0

    all表示都是缺失值时才删除。

    data.dropna(axis="columns", how="any")
    
    0
    1
    2
    3
    data.loc[3] = np.nan
    data
    
    A B C D
    0 1.0 NaN 2.0 NaN
    1 NaN 4.0 5.0 NaN
    2 7.0 8.0 NaN NaN
    3 NaN NaN NaN NaN
    data.dropna(how="all")
    
    A B C D
    0 1.0 NaN 2.0 NaN
    1 NaN 4.0 5.0 NaN
    2 7.0 8.0 NaN NaN

    3、填充缺失值

    data = pd.DataFrame(np.array([[1, np.nan, 2, 3],
                                  [np.nan, 4, 5, 6],
                                  [7, 8, np.nan, 9],
                                  [10, 11 , 12, 13]]), columns=["A", "B", "C", "D"])
    data
    
    A B C D
    0 1.0 NaN 2.0 3.0
    1 NaN 4.0 5.0 6.0
    2 7.0 8.0 NaN 9.0
    3 10.0 11.0 12.0 13.0
    data.fillna(value=5)
    
    A B C D
    0 1.0 5.0 2.0 3.0
    1 5.0 4.0 5.0 6.0
    2 7.0 8.0 5.0 9.0
    3 10.0 11.0 12.0 13.0
  • 用均值进行替换
  • fill = data.mean()
    fill
    
    A    6.000000
    B    7.666667
    C    6.333333
    D    7.750000
    dtype: float64
    
    data.fillna(value=fill)
    
    A B C D
    0 1.0 7.666667 2.000000 3.0
    1 6.0 4.000000 5.000000 6.0
    2 7.0 8.000000 6.333333 9.0
    3 10.0 11.000000 12.000000 13.0

    全部数据的平均值,先进行摊平,再进行填充即可。

    fill = data.stack().mean()
    fill
    
    7.0
    
    data.fillna(value=fill)
    
    A B C D
    0 1.0 7.0 2.0 3.0
    1 7.0 4.0 5.0 6.0
    2 7.0 8.0 7.0 9.0
    3 10.0 11.0 12.0 13.0

    11.5 合并数据

  • 构造一个生产DataFrame的函数
  • import pandas as pd
    import numpy as np
    
    def make_df(cols, ind):
        "一个简单的DataFrame"
        data = {c: [str(c)+str(i) for i in ind]  for c in cols}
        return pd.DataFrame(data, ind)
    
    make_df("ABC", range(3))
    
    A B C
    0 A0 B0 C0
    1 A1 B1 C1
    2 A2 B2 C2
  • 垂直合并
  • df_1 = make_df("AB", [1, 2])
    df_2 = make_df("AB", [3, 4])
    print(df_1)
    print(df_2)
    
        A   B
    1  A1  B1
    2  A2  B2
        A   B
    3  A3  B3
    4  A4  B4
    
    pd.concat([df_1, df_2])
    
    A B
    1 A1 B1
    2 A2 B2
    3 A3 B3
    4 A4 B4
  • 水平合并
  • df_3 = make_df("AB", [0, 1])
    df_4 = make_df("CD", [0, 1])
    print(df_3)
    print(df_4)
    
        A   B
    0  A0  B0
    1  A1  B1
        C   D
    0  C0  D0
    1  C1  D1
    
    pd.concat([df_3, df_4], axis=1)
    
    A B C D
    0 A0 B0 C0 D0
    1 A1 B1 C1 D1
  • 索引重叠
  • 行重叠

    df_5 = make_df("AB", [1, 2])
    df_6 = make_df("AB", [1, 2])
    print(df_5)
    print(df_6)
    
        A   B
    1  A1  B1
    2  A2  B2
        A   B
    1  A1  B1
    2  A2  B2
    
    pd.concat([df_5, df_6])
    
    A B
    1 A1 B1
    2 A2 B2
    1 A1 B1
    2 A2 B2
    pd.concat([df_5, df_6],ignore_index=True)
    
    A B
    0 A1 B1
    1 A2 B2
    2 A1 B1
    3 A2 B2

    列重叠

    df_7 = make_df("ABC", [1, 2])
    df_8 = make_df("BCD", [1, 2])
    print(df_7)
    print(df_8)
    
        A   B   C
    1  A1  B1  C1
    2  A2  B2  C2
        B   C   D
    1  B1  C1  D1
    2  B2  C2  D2
    
    pd.concat([df_7, df_8], axis=1)
    
    A B C B C D
    1 A1 B1 C1 B1 C1 D1
    2 A2 B2 C2 B2 C2 D2
    pd.concat([df_7, df_8],axis=1, ignore_index=True)
    
    0 1 2 3 4 5
    1 A1 B1 C1 B1 C1 D1
    2 A2 B2 C2 B2 C2 D2
  • 对齐合并merge()
  • df_9 = make_df("AB", [1, 2])
    df_10 = make_df("BC", [1, 2])
    print(df_9)
    print(df_10)
    
        A   B
    1  A1  B1
    2  A2  B2
        B   C
    1  B1  C1
    2  B2  C2
    
    pd.merge(df_9, df_10)
    
    A B C
    0 A1 B1 C1
    1 A2 B2 C2
    df_9 = make_df("AB", [1, 2])
    df_10 = make_df("CB", [2, 1])
    print(df_9)
    print(df_10)
    
        A   B
    1  A1  B1
    2  A2  B2
        C   B
    2  C2  B2
    1  C1  B1
    
    pd.merge(df_9, df_10)
    
    A B C
    0 A1 B1 C1
    1 A2 B2 C2

    【例】 合并城市信息

    population_dict = {"city": ("BeiJing", "HangZhou", "ShenZhen"),
                       "pop": (2154, 981, 1303)}
    population = pd.DataFrame(population_dict)
    population
    
    city pop
    0 BeiJing 2154
    1 HangZhou 981
    2 ShenZhen 1303
    GDP_dict = {"city": ("BeiJing", "ShangHai", "HangZhou"),
                "GDP": (30320, 32680, 13468)}
    GDP = pd.DataFrame(GDP_dict)
    GDP
    
    city GDP
    0 BeiJing 30320
    1 ShangHai 32680
    2 HangZhou 13468
    city_info = pd.merge(population, GDP)
    city_info
    
    city pop GDP
    0 BeiJing 2154 30320
    1 HangZhou 981 13468

    这里outer是求并集

    city_info = pd.merge(population, GDP, how="outer")
    city_info
    
    city pop GDP
    0 BeiJing 2154.0 30320.0
    1 HangZhou 981.0 13468.0
    2 ShenZhen 1303.0 NaN
    3 ShangHai NaN 32680.0

    11.6 分组和数据透视表

    image-20221002211059551

    df = pd.DataFrame({"key":["A", "B", "C", "C", "B", "A"],
                      "data1": range(6),
                      "data2": np.random.randint(0, 10, size=6)})
    df
    
    key data1 data2
    0 A 0 1
    1 B 1 4
    2 C 2 9
    3 C 3 9
    4 B 4 1
    5 A 5 9

    (1)分组

  • 延迟计算
  • df.groupby("key")
    
    <pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002276795A240>
    

    这说明已经分好了,等待我们用什么样的方法进行处理后,再显示。

    df.groupby("key").sum()
    
    data1 data2
    key
    A 5 10
    B 5 6
    C 5 11
    df.groupby("key").mean()
    
    data1 data2
    key
    A 2.5 5.0
    B 2.5 3.0
    C 2.5 5.5

    可以打印看看这是什么东西:

    for i in df.groupby("key"):
        print(str(i))
    
    ('A',   key  data1  data2
    0   A      0      2
    5   A      5      8)
    ('B',   key  data1  data2
    1   B      1      2
    4   B      4      4)
    ('C',   key  data1  data2
    2   C      2      8
    3   C      3      3)
    
  • 按列取值
  • df.groupby("key")["data2"].sum()
    
    key
    A    10
    B     6
    C    11
    Name: data2, dtype: int32
    
  • 按组迭代
  • for data, group in df.groupby("key"):
        print("{0:5} shape={1}".format(data, group.shape))
    
    A     shape=(2, 3)
    B     shape=(2, 3)
    C     shape=(2, 3)
    
  • 调用方法
  • df.groupby("key")["data1"].describe()
    
    count mean std min 25% 50% 75% max
    key
    A 2.0 2.5 3.535534 0.0 1.25 2.5 3.75 5.0
    B 2.0 2.5 2.121320 1.0 1.75 2.5 3.25 4.0
    C 2.0 2.5 0.707107 2.0 2.25 2.5 2.75 3.0
  • 支持更复杂的操作
  • df.groupby("key").aggregate(["min", "median", "max"])
    
    data1 data2
    min median max min median max
    key
    A 0 2.5 5 2 5.0 8
    B 1 2.5 4 2 3.0 4
    C 2 2.5 3 3 5.5 8
  • 过滤
  • def filter_func(x):
        return x["data2"].std() > 3
    df.groupby("key")["data2"].std()
    
    key
    A    4.242641
    B    1.414214
    C    3.535534
    Name: data2, dtype: float64
    
    df.groupby("key").filter(filter_func)
    
    key data1 data2
    0 A 0 2
    2 C 2 8
    3 C 3 3
    5 A 5 8
  • 转换
  • df
    
    key data1 data2
    0 A 0 2
    1 B 1 2
    2 C 2 8
    3 C 3 3
    4 B 4 4
    5 A 5 8
    df.groupby("key").transform(lambda x: x-x.mean())
    
    data1 data2
    0 -2.5 -3.0
    1 -1.5 -1.0
    2 -0.5 2.5
    3 0.5 -2.5
    4 1.5 1.0
    5 2.5 3.0
    df
    
    key data1 data2
    0 A 0 1
    1 B 1 4
    2 C 2 9
    3 C 3 9
    4 B 4 1
    5 A 5 9
    df.groupby("key").apply(lambda x: x-x.mean())
    
    data1 data2
    0 -2.5 -4.0
    1 -1.5 1.5
    2 -0.5 0.0
    3 0.5 0.0
    4 1.5 -1.5
    5 2.5 4.0
  • apply()方法
  • df
    
    key data1 data2
    0 A 0 2
    1 B 1 2
    2 C 2 8
    3 C 3 3
    4 B 4 4
    5 A 5 8
    def norm_by_data2(x):
        x["data1"] /= x["data2"].sum()
        return x
    
    df.groupby("key").apply(norm_by_data2)
    
    key data1 data2
    0 A 0.000000 2
    1 B 0.166667 2
    2 C 0.181818 8
    3 C 0.272727 3
    4 B 0.666667 4
    5 A 0.500000 8
  • 将列表、数组设为分组键
  • 这里的L相当于一个新的标签替代原来的行标签。

    L = [0, 1, 0, 1, 2, 0]
    df
    
    key data1 data2
    0 A 0 2
    1 B 1 2
    2 C 2 8
    3 C 3 3
    4 B 4 4
    5 A 5 8
    df.groupby(L).sum()
    
    data1 data2
    0 7 18
    1 4 5
    2 4 4
  • 用字典将索引映射到分组
  • df2 = df.set_index("key")
    df2
    
    data1 data2
    key
    A 0 2
    B 1 2
    C 2 8
    C 3 3
    B 4 4
    A 5 8
    mapping = {"A": "first", "B": "constant", "C": "constant"}
    df2.groupby(mapping).sum()
    
    data1 data2
    constant 10 17
    first 5 10
  • 任意Python函数
  • df2.groupby(str.lower).mean()
    
    data1 data2
    a 2.5 5.0
    b 2.5 3.0
    c 2.5 5.5
  • 多个有效值组成的列表
  • 只有这两个数都相等,才会分到同一个组。

    df2.groupby([str.lower, mapping]).mean()
    
    data1 data2
    a first 2.5 5.0
    b constant 2.5 3.0
    c constant 2.5 5.5

    【例1】 行星观测数据处理

    import seaborn as sns
    
    planets = sns.load_dataset("planets")
    
    planets.shape
    
    (1035, 6)
    
    planets.head()
    
    method number orbital_period mass distance year
    0 Radial Velocity 1 269.300 7.10 77.40 2006
    1 Radial Velocity 1 874.774 2.21 56.95 2008
    2 Radial Velocity 1 763.000 2.60 19.84 2011
    3 Radial Velocity 1 326.030 19.40 110.62 2007
    4 Radial Velocity 1 516.220 10.50 119.47 2009
    planets.describe()
    
    number orbital_period mass distance year
    count 1035.000000 992.000000 513.000000 808.000000 1035.000000
    mean 1.785507 2002.917596 2.638161 264.069282 2009.070531
    std 1.240976 26014.728304 3.818617 733.116493 3.972567
    min 1.000000 0.090706 0.003600 1.350000 1989.000000
    25% 1.000000 5.442540 0.229000 32.560000 2007.000000
    50% 1.000000 39.979500 1.260000 55.250000 2010.000000
    75% 2.000000 526.005000 3.040000 178.500000 2012.000000
    max 7.000000 730000.000000 25.000000 8500.000000 2014.000000
    planets.head()
    
    method number orbital_period mass distance year
    0 Radial Velocity 1 269.300 7.10 77.40 2006
    1 Radial Velocity 1 874.774 2.21 56.95 2008
    2 Radial Velocity 1 763.000 2.60 19.84 2011
    3 Radial Velocity 1 326.030 19.40 110.62 2007
    4 Radial Velocity 1 516.220 10.50 119.47 2009
    decade = 10 * (planets["year"] // 10)
    decade.head()
    
    0    2000
    1    2000
    2    2010
    3    2000
    4    2000
    Name: year, dtype: int64
    
    decade = decade.astype(str) + "s"
    decade.name = "decade"
    decade.head()
    
    0    2000s
    1    2000s
    2    2010s
    3    2000s
    4    2000s
    Name: decade, dtype: object
    
    planets.head()
    
    method number orbital_period mass distance year
    0 Radial Velocity 1 269.300 7.10 77.40 2006
    1 Radial Velocity 1 874.774 2.21 56.95 2008
    2 Radial Velocity 1 763.000 2.60 19.84 2011
    3 Radial Velocity 1 326.030 19.40 110.62 2007
    4 Radial Velocity 1 516.220 10.50 119.47 2009
    planets.groupby(["method", decade]).sum()
    
    number orbital_period mass distance year
    method decade
    Astrometry 2010s 2 1.262360e+03 0.00000 35.75 4023
    Eclipse Timing Variations 2000s 5 1.930800e+04 6.05000 261.44 6025
    2010s 10 2.345680e+04 4.20000 1000.00 12065
    Imaging 2000s 29 1.350935e+06 0.00000 956.83 40139
    2010s 21 6.803750e+04 0.00000 1210.08 36208
    Microlensing 2000s 12 1.732500e+04 0.00000 0.00 20070
    2010s 15 4.750000e+03 0.00000 41440.00 26155
    Orbital Brightness Modulation 2010s 5 2.127920e+00 0.00000 2360.00 6035
    Pulsar Timing 1990s 9 1.900153e+02 0.00000 0.00 5978
    2000s 1 3.652500e+04 0.00000 0.00 2003
    2010s 1 9.070629e-02 0.00000 1200.00 2011
    Pulsation Timing Variations 2000s 1 1.170000e+03 0.00000 0.00 2007
    Radial Velocity 1980s 1 8.388800e+01 11.68000 40.57 1989
    1990s 52 1.091561e+04 68.17820 723.71 55943
    2000s 475 2.633526e+05 945.31928 15201.16 619775
    2010s 424 1.809630e+05 316.47890 11382.67 432451
    Transit 2000s 64 2.897102e+02 0.00000 31823.31 124462
    2010s 712 8.087813e+03 1.47000 102419.46 673999
    Transit Timing Variations 2010s 9 2.393505e+02 0.00000 3313.00 8050

    这里使用两个中括号[[]],取出来是DF类型的数据,而一个中括号[]取出来是Serios的数据,前者更美观一点。

    planets.groupby(["method", decade])[["number"]].sum().unstack().fillna(0)
    
    number
    decade 1980s 1990s 2000s 2010s
    method
    Astrometry 0.0 0.0 0.0 2.0
    Eclipse Timing Variations 0.0 0.0 5.0 10.0
    Imaging 0.0 0.0 29.0 21.0
    Microlensing 0.0 0.0 12.0 15.0
    Orbital Brightness Modulation 0.0 0.0 0.0 5.0
    Pulsar Timing 0.0 9.0 1.0 1.0
    Pulsation Timing Variations 0.0 0.0 1.0 0.0
    Radial Velocity 1.0 52.0 475.0 424.0
    Transit 0.0 0.0 64.0 712.0
    Transit Timing Variations 0.0 0.0 0.0 9.0

    (2)数据透视表

    【例2】泰坦尼克号乘客数据分析

    import seaborn as sns
    
    titanic = sns.load_dataset("titanic")
    
    titanic.head()
    
    survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
    0 0 3 male 22.0 1 0 7.2500 S Third man True NaN Southampton no False
    1 1 1 female 38.0 1 0 71.2833 C First woman False C Cherbourg yes False
    2 1 3 female 26.0 0 0 7.9250 S Third woman False NaN Southampton yes True
    3 1 1 female 35.0 1 0 53.1000 S First woman False C Southampton yes False
    4 0 3 male 35.0 0 0 8.0500 S Third man True NaN Southampton no True
    T = titanic[titanic.age.notnull()].copy()
    
    T.age.apply(lambda x: 60 if x>=60 else x)
    T.age.value_counts()
    
    24.00    30
    22.00    27
    60.00    26
    18.00    26
    28.00    25
    30.00    25
    19.00    25
    21.00    24
    25.00    23
    36.00    22
    29.00    20
    35.00    18
    32.00    18
    27.00    18
    26.00    18
    31.00    17
    16.00    17
    34.00    15
    20.00    15
    33.00    15
    23.00    15
    39.00    14
    40.00    13
    17.00    13
    42.00    13
    45.00    12
    38.00    11
    4.00     10
    50.00    10
    2.00     10
             ..
    8.00      4
    5.00      4
    11.00     4
    6.00      3
    7.00      3
    46.00     3
    30.50     2
    57.00     2
    0.83      2
    55.00     2
    10.00     2
    59.00     2
    13.00     2
    28.50     2
    40.50     2
    45.50     2
    0.75      2
    32.50     2
    34.50     1
    55.50     1
    0.92      1
    36.50     1
    12.00     1
    53.00     1
    14.50     1
    0.67      1
    20.50     1
    23.50     1
    24.50     1
    0.42      1
    Name: age, Length: 77, dtype: int64
    
    Age = 10*(T["age"]//10)
    Age = Age.astype(int)
    Age.head()
    Age.value_counts()
    
    20    220
    30    167
    10    102
    40     89
    0      62
    50     48
    60     26
    Name: age, dtype: int64
    
    Age.astype(str)+"s"
    
    0      20s
    1      30s
    2      20s
    3      30s
    4      30s
    6      50s
    7       0s
    8      20s
    9      10s
    10      0s
    11     50s
    12     20s
    13     30s
    14     10s
    15     50s
    16      0s
    18     30s
    20     30s
    21     30s
    22     10s
    23     20s
    24      0s
    25     30s
    27     10s
    30     40s
    33     60s
    34     20s
    35     40s
    37     20s
    38     10s
          ... 
    856    40s
    857    50s
    858    20s
    860    40s
    861    20s
    862    40s
    864    20s
    865    40s
    866    20s
    867    30s
    869     0s
    870    20s
    871    40s
    872    30s
    873    40s
    874    20s
    875    10s
    876    20s
    877    10s
    879    50s
    880    20s
    881    30s
    882    20s
    883    20s
    884    20s
    885    30s
    886    20s
    887    10s
    889    20s
    890    30s
    Name: age, Length: 714, dtype: object
    
    T.groupby(["sex", Age])["survived"].mean().unstack()
    
    age 0 10 20 30 40 50 60
    sex
    female 0.633333 0.755556 0.722222 0.833333 0.687500 0.888889 1.000000
    male 0.593750 0.122807 0.168919 0.214953 0.210526 0.133333 0.136364
    T.age = Age
    T.pivot_table("survived", index="sex", columns="age")
    
    age 0 10 20 30 40 50 60
    sex
    female 0.633333 0.755556 0.722222 0.833333 0.687500 0.888889 1.000000
    male 0.593750 0.122807 0.168919 0.214953 0.210526 0.133333 0.136364
    titanic.describe()
    
    survived pclass age sibsp parch fare
    count 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
    mean 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
    std 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
    min 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
    25% 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
    50% 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
    75% 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
    max 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200
    titanic.groupby("sex")[["survived"]].mean()
    
    survived
    sex
    female 0.742038
    male 0.188908
    titanic.groupby("sex")["survived"].mean()
    
    sex
    female    0.742038
    male      0.188908
    Name: survived, dtype: float64
    
    titanic.groupby(["sex", "class"])["survived"].aggregate("mean").unstack()
    
    class First Second Third
    sex
    female 0.968085 0.921053 0.500000
    male 0.368852 0.157407 0.135447
  • 数据透视表:用更直观的方式实现上面的功能。
  • titanic.pivot_table("survived", index="sex", columns="class") # 默认返回平均值
    
    class First Second Third
    sex
    female 0.968085 0.921053 0.500000
    male 0.368852 0.157407 0.135447
    titanic.pivot_table("survived", index="sex", columns="class", aggfunc="mean", margins=True) # aggfunc="mean"即为默认值 margins=True 会加一个总的列和总的行。
    
    class First Second Third All
    sex
    female 0.968085 0.921053 0.500000 0.742038
    male 0.368852 0.157407 0.135447 0.188908
    All 0.629630 0.472826 0.242363 0.383838
    titanic.pivot_table(index="sex", columns="class", aggfunc={"survived": "sum", "fare": "mean"}) # 要处理的那一列和要处理的方法组成一个键值对。
    
    fare survived
    class First Second Third First Second Third
    sex
    female 106.125798 21.970121 16.118810 91 70 72
    male 67.226127 19.741782 12.661633 45 17 47

    11.7 其他

    (1)向量化字符串操作

    (2) 处理时间序列

    (3) 多级索引:用于多维数据

    base_data = np.array([[1771, 11115 ],
                          [2154, 30320],
                          [2141, 14070],
                          [2424, 32680],
                          [1077, 7806],
                          [1303, 24222],
                          [798, 4789],
                          [981, 13468]]) 
    data = pd.DataFrame(base_data, index=[["BeiJing","BeiJing","ShangHai","ShangHai","ShenZhen","ShenZhen","HangZhou","HangZhou"]\
                                         , [2008, 2018]*4], columns=["population", "GDP"])
    data
    
    population GDP
    BeiJing 2008 1771 11115
    2018 2154 30320
    ShangHai 2008 2141 14070
    2018 2424 32680
    ShenZhen 2008 1077 7806
    2018 1303 24222
    HangZhou 2008 798 4789
    2018 981 13468
    data.index.names = ["city", "year"]
    data
    
    population GDP
    city year
    BeiJing 2008 1771 11115
    2018 2154 30320
    ShangHai 2008 2141 14070
    2018 2424 32680
    ShenZhen 2008 1077 7806
    2018 1303 24222
    HangZhou 2008 798 4789
    2018 981 13468
    data["GDP"]
    
    city      year
    BeiJing   2008    11115
              2018    30320
    ShangHai  2008    14070
              2018    32680
    ShenZhen  2008     7806
              2018    24222
    HangZhou  2008     4789
              2018    13468
    Name: GDP, dtype: int32
    
    data.loc["ShangHai", "GDP"]
    
    year
    2008    14070
    2018    32680
    Name: GDP, dtype: int32
    
    data.loc["ShangHai", 2018]["GDP"]
    
    32680
    

    (4) 高性能的Pandas:eval()

    df1, df2, df3, df4 = (pd.DataFrame(np.random.random((10000,100))) for i in range(4))
    
    %timeit (df1+df2)/(df3+df4)
    
    17.6 ms ± 120 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
  • 减少了复合代数式计算中间过程的内存分配
  • %timeit pd.eval("(df1+df2)/(df3+df4)")
    
    10.5 ms ± 153 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    np.allclose((df1+df2)/(df3+df4), pd.eval("(df1+df2)/(df3+df4)"))
    
    True
    
  • 实现列间运算
  • df = pd.DataFrame(np.random.random((1000, 3)), columns=list("ABC"))
    df.head()
    
    A B C
    0 0.418071 0.381836 0.500556
    1 0.059432 0.749066 0.302429
    2 0.489147 0.739153 0.777161
    3 0.175441 0.016556 0.348979
    4 0.766534 0.559252 0.310635
    res_1 = pd.eval("(df.A+df.B)/(df.C-1)")
    
    res_2 = df.eval("(A+B)/(C-1)")
    
    np.allclose(res_1, res_2)
    
    True
    
    df["D"] = pd.eval("(df.A+df.B)/(df.C-1)")
    df.head()
    
    A B C D
    0 0.418071 0.381836 0.500556 -1.601593
    1 0.059432 0.749066 0.302429 -1.159019
    2 0.489147 0.739153 0.777161 -5.512052
    3 0.175441 0.016556 0.348979 -0.294917
    4 0.766534 0.559252 0.310635 -1.923199
    df.eval("D=(A+B)/(C-1)", inplace=True)
    df.head()
    
    A B C D
    0 0.418071 0.381836 0.500556 -1.601593
    1 0.059432 0.749066 0.302429 -1.159019
    2 0.489147 0.739153 0.777161 -5.512052
    3 0.175441 0.016556 0.348979 -0.294917
    4 0.766534 0.559252 0.310635 -1.923199
  • 使用局部变量
  • column_mean = df.mean(axis=1)
    res = df.eval("A+@column_mean")
    res.head()
    
    0    0.342788
    1    0.047409
    2   -0.387501
    3    0.236956
    4    0.694839
    dtype: float64
    

    (4) 高性能的Pandas:query()

    df.head()
    
    A B C D
    0 0.418071 0.381836 0.500556 -1.601593
    1 0.059432 0.749066 0.302429 -1.159019
    2 0.489147 0.739153 0.777161 -5.512052
    3 0.175441 0.016556 0.348979 -0.294917
    4 0.766534 0.559252 0.310635 -1.923199
    %timeit df[(df.A < 0.5) & (df.B > 0.5)]
    
    1.11 ms ± 9.38 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    
    %timeit df.query("(A < 0.5)&(B > 0.5)")
    
    2.55 ms ± 199 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    df.query("(A < 0.5)&(B > 0.5)").head()
    
    A B C D
    1 0.059432 0.749066 0.302429 -1.159019
    2 0.489147 0.739153 0.777161 -5.512052
    7 0.073950 0.730144 0.646190 -2.272672
    10 0.393200 0.610467 0.697096 -3.313485
    11 0.065734 0.764699 0.179380 -1.011958
    np.allclose(df[(df.A < 0.5) & (df.B > 0.5)], df.query("(A < 0.5)&(B > 0.5)"))
    
    True
    

    (5)eval()和query()的使用时机

    小数组时,普通方法反而更快

    df.values.nbytes
    
    32000
    
    df1.values.nbytes
    
    8000000
    
    物联沃分享整理
    物联沃-IOTWORD物联网 » Python基础(十一) | 超详细的Pandas库三万字总结

    发表评论