文章目录

  • 1. 语法及参数
  • 2. 参数详解(含实例)
  • 2.1 bins
  • 2.2 retbins
  • 2.3 precision
  • 2.4 labels
  • 2.5 ordered
  • 2.6 right
  • 2.7 include_lowest
  • 2.8 duplicates

  • pandas.cut()函数可以将数据进行分类成不同的区间值。在数据分析中,例如有一组年龄数据,现在需要对不同的年龄层次的用户进行分析,那么我们可以根据不同年龄层次所对应的年龄段来作为划分区间,例如 bins = [1,28,50,150],对应 labels = [“青少年”,“中年”,“老年”],划分完后我们就可以很容易取出不同年龄段的用户数据。不仅是年龄数据,对于需要划分区间的数据都是十分有用的。

    1. 语法及参数

    pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, duplicates='raise', ordered=True)
    

    参数解释:

    x:分箱时输入的数组,必须为一位数组
    bins:分类依据的标准,可以是int、标量序列或间隔索引(IntervalIndex)
    right:是否包含bins区间的最右边,默认为True,最右边为闭区间,False则不包含
    labels:要返回的标签,和bins的区间对应
    retbins:是否返回bins,当bins作为标量时使用非常有用,默认为False
    precision:精度,int类型
    include_lowest:第一个区间是否为左包含(左边为闭区间),默认为False,表示不包含,True则包含
    duplicates:可选,默认为{default 'raise', 'drop'},如果 bin 边缘不是唯一的,则引发 ValueError 或删除非唯一的。
    ordered:默认为True,表示标签是否有序。如果为 True,则将对生成的分类进行排序。如果为 False,则生成的分类将是无序的(必须提供标签)
    

    2. 参数详解(含实例)

    import numpy as np
    import pandas as pd
    

    2.1 bins

    分类依据的标准,可以是int标量序列IntervalIndex

    当bins为整数时,表示几等分

    # 将数据3等分,返回的是数据中每个值所在的分类区间
    pd.cut(np.array([2,6,4,8,1,5,9]),bins=3)  
    
    [(0.992, 3.667], (3.667, 6.333], (3.667, 6.333], (6.333, 9.0], (0.992, 3.667], (3.667, 6.333], (6.333, 9.0]]
    Categories (3, interval[float64]): [(0.992, 3.667] < (3.667, 6.333] < (6.333, 9.0]]
    

    可以看到根据输入的一位数组自动划分为三个等分区间 (0.992, 3.667] 、(3.667, 6.333] 、(6.333, 9.0],根据一维数组中的值对应哪个区间,则返回对应的那个区间,比如 2 属于 (0.992, 3.667],则返回区间 (0.992, 3.667]

    bins 为标量序列,以列表为例,用于指定划分区间,当x中的数据都不在指定划分区间内,返回 NaN

    pd.cut(np.array([2,6,4,8,1,5,9]),bins=[1,4,7,10])
    
    [(1.0, 4.0], (4.0, 7.0], (1.0, 4.0], (7.0, 10.0], NaN, (4.0, 7.0], (7.0, 10.0]]
    Categories (3, interval[int64]): [(1, 4] < (4, 7] < (7, 10]]
    

    当bins为间隔索引(IntervalIndex),IntervalIndex 未涵盖的值设置为 NaN

    bins = pd.IntervalIndex.from_tuples([(0, 2), (3, 6), (7, 8)]) # 创建IntervalIndex
    pd.cut(np.array([2,6,4,8,1,5,9]),bins)
    
    [(0.0, 2.0], (3.0, 6.0], (3.0, 6.0], (7.0, 8.0], (0.0, 2.0], (3.0, 6.0], NaN]
    Categories (3, interval[int64]): [(0, 2] < (3, 6] < (7, 8]]
    

    2.2 retbins

    是否返回bins,当bins作为标量时使用非常有用,默认为False

    # retbins=True返回等分的分类区间
    pd.cut(np.array([2,6,4,8,1,5,9]),bins=3,retbins=True)
    
    ([(0.992, 3.667], (3.667, 6.333], (3.667, 6.333], (6.333, 9.0], (0.992, 3.667], (3.667, 6.333], (6.333, 9.0]]
     Categories (3, interval[float64]): [(0.992, 3.667] < (3.667, 6.333] < (6.333, 9.0]],
     array([0.992     , 3.66666667, 6.33333333, 9.        ]))
    

    可以看到返回了一个一维数组 array([0.992 , 3.66666667, 6.33333333, 9. ])),这个数组就是划分区间的依据bins,bins=[0.992 , 3.66666667, 6.33333333, 9. ]

    2.3 precision

    精度,int类型,表示区间值的小数位数,0和1是一样的

    print(pd.cut(np.array([2,6,4,8,1,5,9]),bins=3,precision=0))
    print("="*110)
    print(pd.cut(np.array([2,6,4,8,1,5,9]),bins=3,precision=1))
    print("="*110)
    print(pd.cut(np.array([2,6,4,8,1,5,9]),bins=3,precision=2))
    print("="*110)
    print(pd.cut(np.array([2,6,4,8,1,5,9]),bins=3,precision=3))
    
    [(1.0, 4.0], (4.0, 6.0], (4.0, 6.0], (6.0, 9.0], (1.0, 4.0], (4.0, 6.0], (6.0, 9.0]]
    Categories (3, interval[float64]): [(1.0, 4.0] < (4.0, 6.0] < (6.0, 9.0]]
    ==============================================================================================================
    [(1.0, 3.7], (3.7, 6.3], (3.7, 6.3], (6.3, 9.0], (1.0, 3.7], (3.7, 6.3], (6.3, 9.0]]
    Categories (3, interval[float64]): [(1.0, 3.7] < (3.7, 6.3] < (6.3, 9.0]]
    ==============================================================================================================
    [(0.99, 3.67], (3.67, 6.33], (3.67, 6.33], (6.33, 9.0], (0.99, 3.67], (3.67, 6.33], (6.33, 9.0]]
    Categories (3, interval[float64]): [(0.99, 3.67] < (3.67, 6.33] < (6.33, 9.0]]
    ==============================================================================================================
    [(0.992, 3.667], (3.667, 6.333], (3.667, 6.333], (6.333, 9.0], (0.992, 3.667], (3.667, 6.333], (6.333, 9.0]]
    Categories (3, interval[float64]): [(0.992, 3.667] < (3.667, 6.333] < (6.333, 9.0]]
    

    2.4 labels

    指定返回的 bins 的标签。必须与生成的 bins 长度相同。如果为 False,则仅返回 bin 的整数指示符。当bin是 IntervalIndex时,忽略此参数。如果为 True,则引发错误。

    将等分的区间用标签labels替代,标签个数要和等分区间个数一致,几等分就几个标签

    print(pd.cut(np.array([2,6,4,8,1,5,9]),bins=3))
    print("="*110)
    print(pd.cut(np.array([2,6,4,8,1,5,9]),bins=3,labels=["L","M","H"]))
    
    [(0.992, 3.667], (3.667, 6.333], (3.667, 6.333], (6.333, 9.0], (0.992, 3.667], (3.667, 6.333], (6.333, 9.0]]
    Categories (3, interval[float64]): [(0.992, 3.667] < (3.667, 6.333] < (6.333, 9.0]]
    ==============================================================================================================
    ['L', 'M', 'M', 'H', 'L', 'M', 'H']
    Categories (3, object): ['L' < 'M' < 'H']
    

    将划分区间的值替换为了labels中的值,本例中"L" = (0.992, 3.667],“M”=(3.667, 6.333],“H”=(6.333, 9.0]

    pd.cut(np.array([2,6,4,8,1,5,9]),bins=[1,4,7,10],labels=["L","M","H"])
    
    ['L', 'M', 'L', 'H', NaN, 'M', 'H']
    Categories (3, object): ['L' < 'M' < 'H']
    

    2.5 ordered

    表示标签是否有序。默认为True,如果为 True,则将对生成的分类进行排序。如果为 False,则生成的分类将是无序的

    注意:使用ordered参数时必须和labels参数连用,否则会报错

    print(pd.cut(np.array([2,6,4,8,1,5,9]),bins=3,labels=["L","M","H"]))
    print("="*110)
    print(pd.cut(np.array([2,6,4,8,1,5,9]),bins=3,labels=["L","M","H"],ordered=False))  #
    print("="*110)
    print(pd.cut(np.array([2,6,4,8,1,5,9]),bins=3,labels=["L","M","H"],ordered=True))
    
    ['L', 'M', 'M', 'H', 'L', 'M', 'H']
    Categories (3, object): ['L' < 'M' < 'H']
    ==============================================================================================================
    ['L', 'M', 'M', 'H', 'L', 'M', 'H']
    Categories (3, object): ['L', 'M', 'H']
    ==============================================================================================================
    ['L', 'M', 'M', 'H', 'L', 'M', 'H']
    Categories (3, object): ['L' < 'M' < 'H']
    

    [‘L’ < ‘M’ < ‘H’] 这个有序的, [‘L’, ‘M’, ‘H’] 这个是无序的

    2.6 right

    是否包含bins区间的最右边,默认为True,最右边为闭区间,False则不包含

    # right是否包含bins区间的最右边
    print(pd.cut(np.array([2,6,4,8,1,5,9]),bins=3)) # 默认为True,每个区间默认为左开右闭
    print("="*110)
    print(pd.cut(np.array([2,6,4,8,1,5,9]),bins=3,right=True))  # 每个区间左开右闭,包含每个区间的右边缘
    print("="*110)
    print(pd.cut(np.array([2,6,4,8,1,5,9]),bins=3,right=False)) # 每个区间左闭右开,不包含每个区间的右边缘
    
    [(0.992, 3.667], (3.667, 6.333], (3.667, 6.333], (6.333, 9.0], (0.992, 3.667], (3.667, 6.333], (6.333, 9.0]]
    Categories (3, interval[float64]): [(0.992, 3.667] < (3.667, 6.333] < (6.333, 9.0]]
    ==============================================================================================================
    [(0.992, 3.667], (3.667, 6.333], (3.667, 6.333], (6.333, 9.0], (0.992, 3.667], (3.667, 6.333], (6.333, 9.0]]
    Categories (3, interval[float64]): [(0.992, 3.667] < (3.667, 6.333] < (6.333, 9.0]]
    ==============================================================================================================
    [[1.0, 3.667), [3.667, 6.333), [3.667, 6.333), [6.333, 9.008), [1.0, 3.667), [3.667, 6.333), [6.333, 9.008)]
    Categories (3, interval[float64]): [[1.0, 3.667) < [3.667, 6.333) < [6.333, 9.008)]
    

    2.7 include_lowest

    第一个区间是否为左包含,默认为False,表示不包含,True则表示包含

    print(pd.cut(np.array([2,6,4,8,1,5,9]),bins=3)) 
    print("="*110)
    print(pd.cut(np.array([2,6,4,8,1,5,9]),bins=3,include_lowest=False)) 
    print("="*110)
    print(pd.cut(np.array([2,6,4,8,1,5,9]),bins=3,include_lowest=True)) 
    
    [(0.992, 3.667], (3.667, 6.333], (3.667, 6.333], (6.333, 9.0], (0.992, 3.667], (3.667, 6.333], (6.333, 9.0]]
    Categories (3, interval[float64]): [(0.992, 3.667] < (3.667, 6.333] < (6.333, 9.0]]
    ==============================================================================================================
    [(0.992, 3.667], (3.667, 6.333], (3.667, 6.333], (6.333, 9.0], (0.992, 3.667], (3.667, 6.333], (6.333, 9.0]]
    Categories (3, interval[float64]): [(0.992, 3.667] < (3.667, 6.333] < (6.333, 9.0]]
    ==============================================================================================================
    [(0.991, 3.667], (3.667, 6.333], (3.667, 6.333], (6.333, 9.0], (0.991, 3.667], (3.667, 6.333], (6.333, 9.0]]
    Categories (3, interval[float64]): [(0.991, 3.667] < (3.667, 6.333] < (6.333, 9.0]]
    

    可以看到当include_lowest=True,第一个区间由(0.992, 3.667]变为了(0.991, 3.667],包含了0.992

    2.8 duplicates

    {默认值 ‘raise’, ‘drop’},如果 bin 边缘不是唯一的,则引发 ValueError ,例如以下语句

    pd.cut(np.array([2,6,4,8,1,9,9]),bins=[0,3,6,9,9])
    

    报错信息如下:

    ---------------------------------------------------------------------------
    ValueError                                Traceback (most recent call last)
    <ipython-input-81-e463bd85b4bf> in <module>
          1 # duplicates {default 'raise', 'drop'},如果 bin 边缘不是唯一的,则引发 ValueError 或删除非唯一的。
    ----> 2 print(pd.cut(np.array([2,6,4,8,1,9,9]),bins=[0,3,6,9,9]))
    
    F:\Anaconda_all\Anaconda\lib\site-packages\pandas\core\reshape\tile.py in cut(x, bins, right, labels, retbins, precision, include_lowest, duplicates, ordered)
        271             raise ValueError("bins must increase monotonically.")
        272 
    --> 273     fac, bins = _bins_to_cuts(
        274         x,
        275         bins,
    
    F:\Anaconda_all\Anaconda\lib\site-packages\pandas\core\reshape\tile.py in _bins_to_cuts(x, bins, right, labels, precision, include_lowest, dtype, duplicates, ordered)
        397     if len(unique_bins) < len(bins) and len(bins) != 2:
        398         if duplicates == "raise":
    --> 399             raise ValueError(
        400                 f"Bin edges must be unique: {repr(bins)}.\n"
        401                 f"You can drop duplicate edges by setting the 'duplicates' kwarg"
    
    ValueError: Bin edges must be unique: array([0, 3, 6, 9, 9]).
    You can drop duplicate edges by setting the 'duplicates' kwarg
    

    解决办法:使用 duplicates="drop"去除重复

    print(pd.cut(np.array([2,6,4,8,1,9,9]),bins=[0,3,6,9,9],duplicates="drop")) 
    
    [(0, 3], (3, 6], (3, 6], (6, 9], (0, 3], (6, 9], (6, 9]]
    Categories (3, interval[int64]): [(0, 3] < (3, 6] < (6, 9]]
    

    有多个重复值也是可以去除的

    pd.cut(np.array([2,6,4,8,1,9,9]),bins=[0,3,6,6,9,9],duplicates="drop")
    
    [(0, 3], (3, 6], (3, 6], (6, 9], (0, 3], (6, 9], (6, 9]]
    Categories (3, interval[int64]): [(0, 3] < (3, 6] < (6, 9]]
    

    来源:芒果去核

    物联沃分享整理
    物联沃-IOTWORD物联网 » pandas.cut()函数的使用

    发表评论