代码收藏家技术教程 2022-07-26

python:多元线性回归总结

最近做的项目要用到多元线性回归，小结一下用python做多元线性回归要用到的代码和步骤：

数据：因变量y,自变量x

1. 导入库

# 导入包
import os
import pandas as pd
import numpy as np
import seaborn as sns
import warnings
import matplotlib.pyplot as plt
from datetime import datetime
import math
import scipy.stats as stats
import statsmodels.api as sm
import statsmodels.formula.api as smf
# 这部分是超参数提前设置
sns.set(style='darkgrid') 
warnings.filterwarnings('ignore')

2. 构建多元线性回归模型

# 模型： y = b0 + b1*X1 + b2*X2 + b3*X3
# ols为带截距项的多元线性回归模型，大写OLS为不带截距项的模型
lm = ols('y ~ x1+x2+x3', data=data).fit()
# 查看模型数据
lm.summary()

3. 变量共线性检验：

# 共线性检测
def vif(df, col_i):
    """
    df: 整份数据
    col_i：被检测的列名
    """
    cols = list(df.columns)
    cols.remove(col_i)
    cols_noti = cols
    formula = col_i + '~' + '+'.join(cols_noti)
    r2 = ols(formula, df).fit().rsquared 
    # 其实就是多元线性回归建模步骤，只是取出了参数 R 平方而已
    return 1. / (1. - r2)

4. 根据summary得出的图表进行显著性检验和拟合优度检验

图大概长上述这样，有必要看的几个参数：

F-statistic：检验自变量x1、x2、…、xp从整体上对y是否有明显的影响

R-squared: 拟合优度，其取值在0到1之间，越接近1，表明回归拟合的效果越好，越接近于0，则效果越差。但R只能直观反映拟合的效果，不能代替F检验作为严格的显著性检验。

P>|t| : 对每个自变量进行显著性检验，看每个自变量是否对y有显著性影响。常用阈值为0.05和0.1，大于阈值即为不显著

5. 模型图形诊断

这一块我从网上现有资源整合出较为全面的原理和代码

残差拟合图

它以残差ei 为纵坐标，以其他适宜的变量（如样本拟合值）为横坐标画散点图,主要用来检验是否存在异方差。一般情况下，当回归模型满足所有假定时，残差图上的n个点的散布应该是随机的，无任何规律。如果残差图上的点的散布呈现出一定趋势（随横坐标的增大而增大或减小），则可以判断回归模型存在异方差.

标准化残差方根散点图

类似残差图

Q-Q plot

检查残差是否服从正态分布

几种常见不服从正态分布的例子：

cook距离图

库克距离用来判断强影响点是否为Y的异常值点。一般认为当D<0.5时认为不是异常值点；当D>0.5时认为是异常值点。

一般看的比较多的就是qq图

def regression_diagnostics(reg,lm):
'''
reg: 回归数据
lm: 拟合模型
'''
  results = pd.DataFrame({'index': reg['log'], # y实际值
              'resids': lm.resid, # 残差
              'std_resids':lm.resid_pearson, # 方差标准化的残差
              'fitted': lm.predict() # y预测值
              })
  print(results.head())
  
  # 1. 图表分别显示
  ## raw residuals vs. fitted
  # 残差拟合图：横坐标是拟合值，纵坐标是残差。
  residsvfitted = plt.plot(results['fitted'], results['resids'],  'o')
  l = plt.axhline(y = 0, color = 'grey', linestyle = 'dashed') # 绘制y=0水平线
  plt.xlabel('Fitted values')
  plt.ylabel('Residuals')
  plt.title('Residuals vs Fitted')
  plt.show(residsvfitted)
  
  
  ## q-q plot
  # 残差QQ图：用来描述残差是否符合正态分布。
  qqplot = sm.qqplot(results['std_resids'], line='s')
  plt.xlabel('Theoretical quantiles')
  plt.ylabel('Sample quantiles')
  plt.title('Normal Q-Q')
  plt.show(qqplot)
  
  
  ## scale-location
  # 标准化的残差对拟合值：对标准化残差平方根和拟合值作图，横坐标是拟合值，纵坐标是标准化后的残差平方根。
  scalelocplot = plt.plot(results['fitted'], abs(results['std_resids'])**.5,  'o')
  plt.xlabel('Fitted values')
  plt.ylabel('Square Root of |standardized residuals|')
  plt.title('Scale-Location')
  plt.show(scalelocplot)
  
  
  ## residuals vs. leverage
  # 标准化残差对杠杆值:通常用Cook距离度量的回归影响点。
  residsvlevplot = sm.graphics.influence_plot(lm, criterion = 'Cooks', size = 2)
  plt.xlabel('Obs.number')
  plt.ylabel("Cook's distance")
  plt.title("Cook's distance")
  plt.show(residsvlevplot)
  plt.close()
  
  
  # 2 绘制在一张画布
  fig = plt.figure(figsize = (10, 10), dpi = 100)
  
  ax1 = fig.add_subplot(2, 2, 1)
  ax1.plot(results['fitted'], results['resids'],  'o')
  l = plt.axhline(y = 0, color = 'grey', linestyle = 'dashed')
  ax1.set_xlabel('Fitted values')
  ax1.set_ylabel('Residuals')
  ax1.set_title('Residuals vs Fitted')
  
  
  ax2 = fig.add_subplot(2, 2, 2)
  sm.qqplot(results['std_resids'], line='s', ax = ax2)
  ax2.set_title('Normal Q-Q')
  
  
  ax3 = fig.add_subplot(2, 2, 3)
  ax3.plot(results['fitted'], abs(results['std_resids'])**.5,  'o')
  ax3.set_xlabel('Fitted values')
  ax3.set_ylabel('Sqrt(|standardized residuals|)')
  ax3.set_title('Scale-Location')
  
  ax4 = fig.add_subplot(2, 2, 4)
  sm.graphics.influence_plot(lm, criterion = 'Cooks', size = 2, ax = ax4)
  
  plt.tight_layout()

如果对你有帮助的话点个赞和收藏吧~

来源：努力的椰椰