天池竞赛——工业蒸汽量预测(完整代码详细解析)

目录

  • 1 赛题理解
  • 1.1 赛题背景
  • 1.2 赛题目标
  • 2 数据探索
  • 2.1 导库
  • 2.2 获取数据
  • 2.3 查看数据
  • 2.4 可视化数据分布
  • 3 特征工程
  • 3.1 异常值分析
  • 3.2 归一化处理
  • 3.3 特征降维
  • 3.5 PCA处理
  • 4 模型训练
  • 4.1 切分数据
  • 4.2 多元线性回归
  • 4.3 随机森林回归
  • 4.4 LGB模型回归
  • 5 调参
  • 5.1 RandomForest网格搜索调参
  • 5.2 RandomForest随机参数优化调参
  • 5.3 LGB调参
  • 6.1 模型融合
  • 完整代码:
  • 1 赛题理解

    1.1 赛题背景

    火力发电的基本原理是:燃料在燃烧时加热水生成蒸汽,蒸汽压力推动汽轮机旋转,然后汽轮机带动发电机旋转,产生电能。在这一系列的能量转化中,影响发电效率的核心是锅炉的燃烧效率,即燃料燃烧加热水产生高温高压蒸汽。锅炉的燃烧效率的影响因素很多,包括锅炉的可调参数,如燃烧给量,一二次风,引风,返料风,给水水量;以及锅炉的工况,比如锅炉床温、床压,炉膛温度、压力,过热器的温度等。

    赛事链接:https://tianchi.aliyun.com/competition/entrance/231693/information

    1.2 赛题目标

    经脱敏后的锅炉传感器采集的数据(采集频率是分钟级别),根据锅炉的工况,预测产生的蒸汽量。

    2 数据探索

    2.1 导库

    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns
    import warnings
    warnings.filterwarnings("ignore")
    from sklearn.linear_model import LinearRegression
    from sklearn.ensemble import RandomForestRegressor # 随机森林回归
    # from sklearn.svm import SVR       # 支持向量机
    import lightgbm as lgb
    from sklearn.model_selection import train_test_split   # 切分数据
    from sklearn.metrics import mean_absolute_error        # 评价指标
    from sklearn.metrics import mean_squared_error
    

    2.2 获取数据

    train_data_file = "D:/download/zhengqi_train.txt"
    test_data_file = "D:/download/zhengqi_test.txt"
    
    train_data = pd.read_csv(train_data_file, sep='\t', encoding='utf-8')
    test_data = pd.read_csv(test_data_file, sep='\t',encoding='utf-8')
    

    2.3 查看数据

    train_data.info() 
    train_data.describe()
    
  • info()与describe()的区别介绍
  • 2.4 可视化数据分布

    # KDE图 对比训练集与数据集中的数据分布
    train_cols=6
    train_rows=len(column)
    plt.figure(figsize=(4*train_cols,4*train_rows))
    
    i=0
    for col in test_data.columns:
        i+=1
        ax = plt.subplot(train_rows,train_cols,i)
        ax = sns.kdeplot(train_data[col], color='red',shade=True)
        ax = sns.kdeplot(test_data[col], color='blue',shade=True)
        plt.ylabel('Frequency')
        ax.legend(['train','test'])
    plt.tight_layout()
    

  • 根据上面KDE图对比可知:V2,V5,V9,V11,v13,V14,V17,V19,V20,V21,V22,V27,这12个训练集和测试集的特征差异较大,予以删除
  • # 删除训练集和测试集的特征差异较大的
    train_data_X_new = train_data_X.drop(['V2','V5','V9','V11','V13','V14','V17','V19','V20','V21','V22','V27'], axis = 1)
    test_data_new = test_data.drop(['V2','V5','V9','V11','V13','V14','V17','V19','V20','V21','V22','V27'], axis = 1)
    all_data_X = pd.concat([train_data_X_new,test_data_new]) 
    

    3 特征工程

  • 特征工程介绍
  • 3.1 异常值分析

  • 以箱线图展示
  • # 异常值分析
    plt.figure(figsize=(18,10))
    plt.boxplot(x=train_data.values, labels=train_data.columns )
    plt.hlines([-7.5,7.5], 0, 40, colors='red')    # 上下界限
    

  • 从箱线图可看出,V9变量明显存在异常,予以删除训练集和测试集中的异常值
  • #  删除异常值
    train_data=train_data[train_data['V9']>-7.5]
    test_data=test_data[test_data['V9']>-9.5]
    

    3.2 归一化处理

    #  归一化处理
    from sklearn import preprocessing
    
    feature_columns = [col for col in test_data.columns]
    min_max_scaler = preprocessing.MinMaxScaler()
    train_data_scaler = min_max_scaler.fit_transform(train_data[feature_columns])   # 进行标准化处理
    test_data_scaler = min_max_scaler.fit_transform(test_data[feature_columns])
    
    train_data_scaler = pd.DataFrame(train_data_scaler)  # 数组转换成表格
    train_data_scaler.columns = feature_columns
    test_data_scaler = pd.DataFrame(test_data_scaler)
    test_data_scaler.columns = feature_columns
    
    train_data_scaler['target']=train_data['target']
    
    display(train_data_scaler.describe())
    display(test_data_scaler.describe())
    

    3.3 特征降维

    #  特征相关性
    plt.figure(figsize=(20,16))
    column = train_data_scaler.columns
    
    mcorr = train_data_scaler[column].corr(method='spearman')  # 相关性
    
    # 特征降维       (相关性筛选)
    mcorr = mcorr.abs()
    numerical_corr = mcorr[mcorr['target']>0.1]['target']   # 筛选>0.1的特征变量, 并只显示特征变量
    numerical_corr.sort_values(ascending=False)  # 从大到小排序
    

    3.5 PCA处理

    #  PCA 处理    (除去数据的多重共线性)
    from sklearn.decomposition import PCA
    
    pca = PCA(n_components=0.9)   # 保持90%的信息
    
    new_train_pca = pca.fit_transform(train_data_scaler.iloc[:,0:-1])
    new_test_pca = pca.fit_transform(test_data_scaler)
    # pd.DataFrame(new_train_pca).describe()
    
  • PCA 处理后保留16个主要成分
  • pca = PCA(n_components=16)
    new_train_pca_16 = pca.fit_transform(train_data_scaler.iloc[:,0:-1])
    new_train_pca_16 = pd.DataFrame(new_train_pca_16)
    new_test_pca_16 = pca.fit_transform(test_data_scaler)
    new_test_pca_16 = pd.DataFrame(new_test_pca_16)
    new_train_pca_16['target']=train_data_scaler['target']
    

    4 模型训练

    4.1 切分数据

    # 切分数据
    # 用PCA保留16维特征数据
    new_train_pca_16 = new_train_pca_16.fillna(0)
    train = new_train_pca_16[new_test_pca_16.columns]
    target = train_data['target']
    
    # 切分数据
    train_data,test_data,train_target, test_target = train_test_split(train,target, test_size=0.2, random_state=0)
    

    采用以下几个模型进行训练和融合:

  • 多元线性回归
  • 随机森林回归
  • LGB模型回归
  • 4.2 多元线性回归

    # 多元线性回归
    clf = LinearRegression()
    clf.fit(train_data, train_target)
    mse = mean_absolute_error(test_target, clf.predict(test_data))
    

    4.3 随机森林回归

    # 随机森林回归
    clf = RandomForestRegressor(n_estimators=400)
    clf.fit(train_data,train_target)
    mse2 = mean_absolute_error(test_target, clf.predict(test_data))
    

    4.4 LGB模型回归

    # LGB模型回归
    clf = lgb.LGBMRegressor(learning_rate=0.01,
                           max_depth=-1,
                           n_estimators=5000,
                           boosting_type='gbdt',
                           random_state=2022,
                           objective='regression')
    clf.fit(X=train_data, y=train_target,eval_metric='MSE',verbose=50)
    mse3 = mean_absolute_error(test_target, clf.predict(test_data))
    
    print('LinearRegression的测试集的MSE得分为:{}'.format(mse))
    print('RandomForestRegressor的测试集的MSE得分为:{}'.format(mse2))
    print('LGBMRegressor的测试集的MSE得分为:{}'.format(mse3))
    

    LinearRegression的测试集的MSE得分为:0.27154696439540776
    RandomForestRegressor的测试集的MSE得分为:0.33357155112651654
    LGBMRegressor的测试集的MSE得分为:0.2925846323943153

    5 调参

    5.1 RandomForest网格搜索调参

    #  # 使用数据训练随机森林模型,采用网格搜索方法调参
    from sklearn.model_selection import GridSearchCV
    from sklearn.ensemble import RandomForestRegressor
    from sklearn.model_selection import train_test_split
    
    train_data, test_data, train_target, test_target = train_test_split(train, target, test_size=0.2, random_state=0)
    randomForestRegression = RandomForestRegressor()
    parameters = {'n_estimators':[50,100,200], 'max_depth':[1,2,3]}
    clf = GridSearchCV(randomForestRegression, parameters, cv=5)
    clf.fit(train_data, train_target)
    score_test = mean_squared_error(test_target, clf.predict(test_data))
    
    print('调参后的RandomForest_Regressor的训练集得分:{}'.format(clf.score(train_data,train_target)))
    print('调参后的RandomForest_Regressor的测试集得分:{}'.format(clf.score(test_data,test_target)))
    print("RandomForest模型调参前MSE:{}".format(mse))
    print("RandomForest模型调参后MSE:{}".format(score_test))
    

    调参后的RandomForest_Regressor的训练集得分:0.7511256945888011
    调参后的RandomForest_Regressor的测试集得分:0.7536945206333742
    RandomForest模型调参前MSE:0.2715462476084652
    RandomForest模型调参后MSE:0.25594319639915

    5.2 RandomForest随机参数优化调参

    # 使用数据训练随机森林模型,采用随机参数优化方法调参
    from sklearn.model_selection import RandomizedSearchCV
    from sklearn.ensemble import RandomForestRegressor
    from sklearn.model_selection import train_test_split 
    
    train_data, test_data, train_target, test_target =train_test_split(train, target, test_size=0.2, random_state=0)
    randomForestRegressior = RandomForestRegressor()
    parameters = {'n_estimators':[50, 100, 200, 300], 'max_depth':[1,2,3,4,5]}
    clf = RandomizedSearchCV(randomForestRegressior, parameters, cv=5)
    clf.fit(train_data, train_target)
    score_test = mean_squared_error(test_target, clf.predict(test_data))
    
    print('调参后的RandomForest_Regressor的训练集得分:{}'.format(clf.score(train_data,train_target)))
    print('调参后的RandomForest_Regressor的测试集得分:{}'.format(clf.score(test_data,test_target)))
    print("RandomForest模型调参前MSE:{}".format(mse))
    print("RandomForest模型调参后MSE:{}".format(score_test))
    

    调参后的RandomForest_Regressor的训练集得分:0.8403572920031047
    调参后的RandomForest_Regressor的测试集得分:0.8108811667658115
    RandomForest模型调参前MSE:0.2715386496432197
    RandomForest模型调参后MSE:0.19651888704102724

    5.3 LGB调参

    # lgb模型调参
    clf = lgb.LGBMRegressor(num_leaves=31)
    parameters = {'learning_rate':[0.01,0.1,1],'n_estimators':[20,40]}
    clf= GridSearchCV(clf, parameters, cv=5)
    clf.fit(train_data, train_target)
    score_test = mean_squared_error(test_target, clf.predict(test_data))
    
    print('调参后的LGB的训练集得分:{}'.format(clf.score(train_data,train_target)))
    print('调参后的LGB的测试集得分:{}'.format(clf.score(test_data,test_target)))
    print("LGB模型调参前MSE:{}".format(mse))
    print("LGB模型调参后MSE:{}".format(score_test))
    
    

    调参后的LGB的训练集得分:0.9323247311228453
    调参后的LGB的测试集得分:0.8634907871306278
    LGB模型调参前MSE:0.2651442640764948
    LGB模型调参后MSE:0.15026337772469497

    6.1 模型融合

  • 将LinearRegression,LGB,RandomForestRegressor三个模型融合
  • # 3个模型融合
    def model_mix(pred_1, pred_2, pred_3):
        result = pd.DataFrame(columns=['LinearRegression', 'LGB', 'RandomForestRegressor', 'Combine'])
    
        for a in range(10):
            for b in range(10):
                for c in range(1,10):
                    test_pred = (a * pred_1 + b * pred_2 + c * pred_3) / (a + b + c)
    
                    mse = mean_squared_error(test_target, test_pred)
    
                    result = result.append([{'LinearRegression': a,
                                             'LGB': b,
                                             'RandomForestRegressor': c,
                                             'Combine': mse}],
                                           ignore_index=True)
        return result
    
    
    model_combine = model_mix(linear_predict, LGB_predict, RandomForest_predict)
    
    model_combine.sort_values(by='Combine', inplace=True)
    print(model_combine.head())
    
  • a, b , c = 10的结果:
  • a, b , c = 30的结果:

    通过上述两次改变权重的实验,发现权重从10加大到30,对最终的combine值有些提高
  • 完整代码:

    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns
    import warnings
    warnings.filterwarnings("ignore")
    from sklearn.linear_model import LinearRegression
    from sklearn.ensemble import RandomForestRegressor # 随机森林回归
    # from sklearn.svm import SVR       # 支持向量机
    import lightgbm as lgb
    from sklearn.model_selection import train_test_split   # 切分数据
    from sklearn.metrics import mean_absolute_error        # 评价指标
    from sklearn.metrics import mean_squared_error
    # from xgboost import XGBRegressor
    
    train_data_file = "D:/download/zhengqi_train.txt"
    test_data_file = "D:/download/zhengqi_test.txt"
    
    train_data = pd.read_csv(train_data_file, sep='\t', encoding='utf-8')
    test_data = pd.read_csv(test_data_file, sep='\t',encoding='utf-8')
    
        # train_data.info()
        # train_data.describe()
    
    train_cols=6
    train_rows=len(train_data.columns)
    plt.figure(figsize=(4*train_cols,4*train_rows))
    i = 0
    for col in test_data.columns:
        i += 1
        ax = plt.subplot(train_rows,train_cols,i)
        ax = sns.kdeplot(train_data[col], color='red',shade=True)
        ax = sns.kdeplot(test_data[col], color='blue',shade=True)
        plt.ylabel('Frequency')
        ax.legend(['train','test'])
    plt.tight_layout()
    
    train_data_y = train_data['target']
    train_data_new = train_data.drop(['V2','V5','V9','V11','V13','V14','V17','V19','V20','V21','V22','V27','target'], axis = 1)
    test_data_new = test_data.drop(['V2','V5','V9','V11','V13','V14','V17','V19','V20','V21','V22','V27'], axis = 1)
    all_data_X = pd.concat([train_data_new,test_data_new])
    
    # 异常值分析
    plt.figure(figsize=(18,10))
    plt.boxplot(x=train_data.values, labels=train_data.columns )
    plt.hlines([-7.5,7.5], 0, 40, colors='red')    # 上下界限
    #  删除异常值
    train_data=train_data[train_data['V9']>-7.5]
    test_data=test_data[test_data['V9']>-9.5]
    
    #  归一化处理
    from sklearn import preprocessing
    
    feature_columns = [col for col in test_data.columns]
    min_max_scaler = preprocessing.MinMaxScaler()
    train_data_scaler = min_max_scaler.fit_transform(train_data[feature_columns])   # 进行标准化处理
    test_data_scaler = min_max_scaler.fit_transform(test_data[feature_columns])
    
    train_data_scaler = pd.DataFrame(train_data_scaler)  # 数组转换成表格
    train_data_scaler.columns = feature_columns
    test_data_scaler = pd.DataFrame(test_data_scaler)
    test_data_scaler.columns = feature_columns
    
    train_data_scaler['target']=train_data['target']
    
    #  特征相关性
    plt.figure(figsize=(20,16))
    column = train_data_scaler.columns
    
    mcorr = train_data_scaler[column].corr(method='spearman')  # 相关性
    mcorr = mcorr.abs()
    numerical_corr = mcorr[mcorr['target']>0.1]['target']   # 筛选>0.1的特征变量, 并只显示特征变量
    numerical_corr.sort_values(ascending=False)  # 从大到小排序
    
    #  PCA 处理    (除去数据的多重共线性)
    from sklearn.decomposition import PCA
    
    pca = PCA(n_components=0.9)   # 保持90%的信息
    
    new_train_pca = pca.fit_transform(train_data_scaler.iloc[:,0:-1])
    new_test_pca = pca.fit_transform(test_data_scaler)
    
    pca = PCA(n_components=16)
    new_train_pca_16 = pca.fit_transform(train_data_scaler.iloc[:,0:-1])
    new_train_pca_16 = pd.DataFrame(new_train_pca_16)
    new_test_pca_16 = pca.fit_transform(test_data_scaler)
    new_test_pca_16 = pd.DataFrame(new_test_pca_16)
    new_train_pca_16['target']=train_data_scaler['target']
    
    
    # 用PCA保留16维特征数据
    new_train_pca_16 = new_train_pca_16.fillna(0)
    train = new_train_pca_16[new_test_pca_16.columns]
    target = train_data['target']
    
    # 切分数据
    train_data,test_data,train_target, test_target = train_test_split(train,target, test_size=0.2, random_state=0)
    
    # 多元线性回归
    clf = LinearRegression()
    clf.fit(train_data, train_target)
    mse = mean_absolute_error(test_target, clf.predict(test_data))
    linear_predict = clf.predict(test_data)
    
    # LGB模型回归
    clf2 = lgb.LGBMRegressor(learning_rate=0.01,
                           max_depth=-1,
                           n_estimators=5000,
                           boosting_type='gbdt',
                           random_state=2022,
                           objective='regression')
    clf2.fit(X=train_data, y=train_target,eval_metric='MSE',verbose=50)
    mse2 = mean_absolute_error(test_target, clf2.predict(test_data))
    LGB_predict = clf2.predict(test_data)
    
    # 随机森林回归
    clf = RandomForestRegressor(n_estimators=400)
    clf.fit(train_data,train_target)
    mse3 = mean_absolute_error(test_target, clf.predict(test_data))
    # RandomForest_predict = clf.predict(test_data)
    
    #  # 使用数据训练随机森林模型,采用网格搜索方法调参
    from sklearn.model_selection import GridSearchCV
    from sklearn.ensemble import RandomForestRegressor
    from sklearn.model_selection import train_test_split
    
    train_data, test_data, train_target, test_target = train_test_split(train, target, test_size=0.2, random_state=0)
    randomForestRegression = RandomForestRegressor()
    parameters = {'n_estimators':[50,100,200], 'max_depth':[1,2,3]}
    clf = GridSearchCV(randomForestRegression, parameters, cv=5)
    clf.fit(train_data, train_target)
    score_test = mean_squared_error(test_target, clf.predict(test_data))
    
    
    # 使用数据训练随机森林模型,采用随机参数优化方法调参
    from sklearn.model_selection import RandomizedSearchCV
    from sklearn.ensemble import RandomForestRegressor
    from sklearn.model_selection import train_test_split
    
    train_data, test_data, train_target, test_target =train_test_split(train, target, test_size=0.2, random_state=0)
    randomForestRegressior = RandomForestRegressor()
    parameters = {'n_estimators':[50, 100, 200, 300], 'max_depth':[1,2,3,4,5]}
    clf = RandomizedSearchCV(randomForestRegressior, parameters, cv=5)
    clf.fit(train_data, train_target)
    score_test = mean_squared_error(test_target, clf.predict(test_data))
    
    # lgb模型调参
    clf3 = lgb.LGBMRegressor(num_leaves=31)
    parameters = {'learning_rate':[0.01,0.1,1],'n_estimators':[20,40]}
    clf3= GridSearchCV(clf3, parameters, cv=5)
    clf3.fit(train_data, train_target)
    score_test = mean_squared_error(test_target, clf3.predict(test_data))
    RandomForest_predict = clf3.predict(test_data)
    # print('调参后的LGB的训练集得分:{}'.format(clf.score(train_data,train_target)))
    # print('调参后的LGB的测试集得分:{}'.format(clf.score(test_data,test_target)))
    # print("LGB模型调参前MSE:{}".format(mse))
    # print("LGB模型调参后MSE:{}".format(score_test))
    
    # 3个模型融合
    def model_mix(pred_1, pred_2, pred_3):
        result = pd.DataFrame(columns=['LinearRegression', 'LGB', 'RandomForestRegressor','Combine'])
    
        for a in range(30):
            for b in range(30):
                for c in range(1,30):
                    test_pred = (a * pred_1 + b * pred_2 + c * pred_3) / (a + b + c)
    
                    mse = mean_squared_error(test_target, test_pred)
    
                    result = result.append([{'LinearRegression': a,
                                            'LGB': b,
                                             'RandomForestRegressor': c,
                                             'Combine': mse}],
                                            ignore_index=True)
        return result
    
    
    model_combine = model_mix(linear_predict, LGB_predict, RandomForest_predict)
    
    model_combine.sort_values(by='Combine', inplace=True)
    print(model_combine.head())
    
    

    来源:Joker_咖啡逗

    物联沃分享整理
    物联沃-IOTWORD物联网 » 天池竞赛——工业蒸汽量预测(完整代码详细解析)

    发表评论