写点什么

机器学习实战系列 [一]:工业蒸汽量预测(最新版本上篇)含数据探索特征工程等

作者:汀丶
  • 2023-03-30
    浙江
  • 本文字数:24131 字

    阅读完需:约 79 分钟

机器学习实战系列[一]:工业蒸汽量预测

  • 背景介绍


火力发电的基本原理是:燃料在燃烧时加热水生成蒸汽,蒸汽压力推动汽轮机旋转,然后汽轮机带动发电机旋转,产生电能。在这一系列的能量转化中,影响发电效率的核心是锅炉的燃烧效率,即燃料燃烧加热水产生高温高压蒸汽。锅炉的燃烧效率的影响因素很多,包括锅炉的可调参数,如燃烧给量,一二次风,引风,返料风,给水水量;以及锅炉的工况,比如锅炉床温、床压,炉膛温度、压力,过热器的温度等。


  • 相关描述


经脱敏后的锅炉传感器采集的数据(采集频率是分钟级别),根据锅炉的工况,预测产生的蒸汽量。


  • 数据说明


数据分成训练数据(train.txt)和测试数据(test.txt),其中字段”V0”-“V37”,这 38 个字段是作为特征变量,”target”作为目标变量。选手利用训练数据训练出模型,预测测试数据的目标变量,排名结果依据预测结果的 MSE(mean square error)。


  • 结果评估


预测结果以 mean square error 作为评判标准。


原项目链接:https://www.heywhale.com/home/column/64141d6b1c8c8b518ba97dcc

1.数据探索性分析

import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as sns
from scipy import stats
import warningswarnings.filterwarnings("ignore") %matplotlib inline
复制代码


# 下载需要用到的数据集!wget http://tianchi-media.oss-cn-beijing.aliyuncs.com/DSW/Industrial_Steam_Forecast/zhengqi_test.txt!wget http://tianchi-media.oss-cn-beijing.aliyuncs.com/DSW/Industrial_Steam_Forecast/zhengqi_train.txt
复制代码


--2023-03-23 18:10:23--  http://tianchi-media.oss-cn-beijing.aliyuncs.com/DSW/Industrial_Steam_Forecast/zhengqi_test.txt正在解析主机 tianchi-media.oss-cn-beijing.aliyuncs.com (tianchi-media.oss-cn-beijing.aliyuncs.com)... 49.7.22.39正在连接 tianchi-media.oss-cn-beijing.aliyuncs.com (tianchi-media.oss-cn-beijing.aliyuncs.com)|49.7.22.39|:80... 已连接。已发出 HTTP 请求,正在等待回应... 200 OK长度: 466959 (456K) [text/plain]正在保存至: “zhengqi_test.txt.1”
zhengqi_test.txt.1 100%[===================>] 456.01K --.-KB/s in 0.04s
2023-03-23 18:10:23 (10.0 MB/s) - 已保存 “zhengqi_test.txt.1” [466959/466959])
--2023-03-23 18:10:23-- http://tianchi-media.oss-cn-beijing.aliyuncs.com/DSW/Industrial_Steam_Forecast/zhengqi_train.txt正在解析主机 tianchi-media.oss-cn-beijing.aliyuncs.com (tianchi-media.oss-cn-beijing.aliyuncs.com)... 49.7.22.39正在连接 tianchi-media.oss-cn-beijing.aliyuncs.com (tianchi-media.oss-cn-beijing.aliyuncs.com)|49.7.22.39|:80... 已连接。已发出 HTTP 请求,正在等待回应... 200 OK长度: 714370 (698K) [text/plain]正在保存至: “zhengqi_train.txt.1”
zhengqi_train.txt.1 100%[===================>] 697.63K --.-KB/s in 0.04s
2023-03-23 18:10:24 (17.9 MB/s) - 已保存 “zhengqi_train.txt.1” [714370/714370])
复制代码


# **读取数据文件**# 使用Pandas库`read_csv()`函数进行数据读取,分割符为‘\t’train_data_file = "./zhengqi_train.txt"test_data_file =  "./zhengqi_test.txt"
train_data = pd.read_csv(train_data_file, sep='\t', encoding='utf-8')test_data = pd.read_csv(test_data_file, sep='\t', encoding='utf-8')
复制代码

1.1 查看数据信息

#查看特征变量信息train_data.info()
复制代码


<class 'pandas.core.frame.DataFrame'>RangeIndex: 2888 entries, 0 to 2887Data columns (total 39 columns): #   Column  Non-Null Count  Dtype  ---  ------  --------------  -----   0   V0      2888 non-null   float64 1   V1      2888 non-null   float64 2   V2      2888 non-null   float64 3   V3      2888 non-null   float64 4   V4      2888 non-null   float64 5   V5      2888 non-null   float64 6   V6      2888 non-null   float64 7   V7      2888 non-null   float64 8   V8      2888 non-null   float64 9   V9      2888 non-null   float64 10  V10     2888 non-null   float64 11  V11     2888 non-null   float64 12  V12     2888 non-null   float64 13  V13     2888 non-null   float64 14  V14     2888 non-null   float64 15  V15     2888 non-null   float64 16  V16     2888 non-null   float64 17  V17     2888 non-null   float64 18  V18     2888 non-null   float64 19  V19     2888 non-null   float64 20  V20     2888 non-null   float64 21  V21     2888 non-null   float64 22  V22     2888 non-null   float64 23  V23     2888 non-null   float64 24  V24     2888 non-null   float64 25  V25     2888 non-null   float64 26  V26     2888 non-null   float64 27  V27     2888 non-null   float64 28  V28     2888 non-null   float64 29  V29     2888 non-null   float64 30  V30     2888 non-null   float64 31  V31     2888 non-null   float64 32  V32     2888 non-null   float64 33  V33     2888 non-null   float64 34  V34     2888 non-null   float64 35  V35     2888 non-null   float64 36  V36     2888 non-null   float64 37  V37     2888 non-null   float64 38  target  2888 non-null   float64dtypes: float64(39)memory usage: 880.1 KB
复制代码


此训练集数据共有 2888 个样本,数据中有 V0-V37 共计 38 个特征变量,变量类型都为数值类型,所有数据特征没有缺失值数据;数据字段由于采用了脱敏处理,删除了特征数据的具体含义;target 字段为标签变量


test_data.info()
复制代码


<class 'pandas.core.frame.DataFrame'>RangeIndex: 1925 entries, 0 to 1924Data columns (total 38 columns): #   Column  Non-Null Count  Dtype  ---  ------  --------------  -----   0   V0      1925 non-null   float64 1   V1      1925 non-null   float64 2   V2      1925 non-null   float64 3   V3      1925 non-null   float64 4   V4      1925 non-null   float64 5   V5      1925 non-null   float64 6   V6      1925 non-null   float64 7   V7      1925 non-null   float64 8   V8      1925 non-null   float64 9   V9      1925 non-null   float64 10  V10     1925 non-null   float64 11  V11     1925 non-null   float64 12  V12     1925 non-null   float64 13  V13     1925 non-null   float64 14  V14     1925 non-null   float64 15  V15     1925 non-null   float64 16  V16     1925 non-null   float64 17  V17     1925 non-null   float64 18  V18     1925 non-null   float64 19  V19     1925 non-null   float64 20  V20     1925 non-null   float64 21  V21     1925 non-null   float64 22  V22     1925 non-null   float64 23  V23     1925 non-null   float64 24  V24     1925 non-null   float64 25  V25     1925 non-null   float64 26  V26     1925 non-null   float64 27  V27     1925 non-null   float64 28  V28     1925 non-null   float64 29  V29     1925 non-null   float64 30  V30     1925 non-null   float64 31  V31     1925 non-null   float64 32  V32     1925 non-null   float64 33  V33     1925 non-null   float64 34  V34     1925 non-null   float64 35  V35     1925 non-null   float64 36  V36     1925 non-null   float64 37  V37     1925 non-null   float64dtypes: float64(38)memory usage: 571.6 KB
复制代码


测试集数据共有 1925 个样本,数据中有 V0-V37 共计 38 个特征变量,变量类型都为数值类型


# 查看数据统计信息train_data.describe()
复制代码


.dataframe tbody tr th:only-of-type { vertical-align: middle; }<pre><code>.dataframe&nbsp;tbody&nbsp;tr&nbsp;th&nbsp;{<br/>&nbsp;&nbsp;&nbsp;&nbsp;vertical-align:&nbsp;top;<br/>}<br/><br/>.dataframe&nbsp;thead&nbsp;th&nbsp;{<br/>&nbsp;&nbsp;&nbsp;&nbsp;text-align:&nbsp;right;<br/>}<br/></code></pre><p>



8 rows × 39 columns


test_data.describe()
复制代码


.dataframe tbody tr th:only-of-type { vertical-align: middle; }<pre><code>.dataframe&nbsp;tbody&nbsp;tr&nbsp;th&nbsp;{<br/>&nbsp;&nbsp;&nbsp;&nbsp;vertical-align:&nbsp;top;<br/>}<br/><br/>.dataframe&nbsp;thead&nbsp;th&nbsp;{<br/>&nbsp;&nbsp;&nbsp;&nbsp;text-align:&nbsp;right;<br/>}<br/></code></pre><p>



8 rows × 38 columns


上面数据显示了数据的统计信息,例如样本数,数据的均值 mean,标准差 std,最小值,最大值等


# 查看数据字段信息train_data.head()
复制代码


.dataframe tbody tr th:only-of-type { vertical-align: middle; }<pre><code>.dataframe&nbsp;tbody&nbsp;tr&nbsp;th&nbsp;{<br/>&nbsp;&nbsp;&nbsp;&nbsp;vertical-align:&nbsp;top;<br/>}<br/><br/>.dataframe&nbsp;thead&nbsp;th&nbsp;{<br/>&nbsp;&nbsp;&nbsp;&nbsp;text-align:&nbsp;right;<br/>}<br/></code></pre><p>



5 rows × 39 columns


上面显示训练集前 5 条数据的基本信息,可以看到数据都是浮点型数据,数据都是数值型连续型特征


test_data.head()
复制代码


.dataframe tbody tr th:only-of-type { vertical-align: middle; }<pre><code>.dataframe&nbsp;tbody&nbsp;tr&nbsp;th&nbsp;{<br/>&nbsp;&nbsp;&nbsp;&nbsp;vertical-align:&nbsp;top;<br/>}<br/><br/>.dataframe&nbsp;thead&nbsp;th&nbsp;{<br/>&nbsp;&nbsp;&nbsp;&nbsp;text-align:&nbsp;right;<br/>}<br/></code></pre><p>



5 rows × 38 columns

1.2 可视化探索数据

fig = plt.figure(figsize=(4, 6))  # 指定绘图对象宽度和高度sns.boxplot(train_data['V0'],orient="v", width=0.5)
复制代码


<matplotlib.axes._subplots.AxesSubplot at 0x7faf89f46950>
复制代码



# 画箱式图# column = train_data.columns.tolist()[:39]  # 列表头# fig = plt.figure(figsize=(20, 40))  # 指定绘图对象宽度和高度# for i in range(38):#     plt.subplot(13, 3, i + 1)  # 13行3列子图#     sns.boxplot(train_data[column[i]], orient="v", width=0.5)  # 箱式图#     plt.ylabel(column[i], fontsize=8)# plt.show()#箱图自行打开
复制代码


查看数据分布图


  • 查看特征变量‘V0’的数据分布直方图,并绘制 Q-Q 图查看数据是否近似于正态分布


plt.figure(figsize=(10,5))
ax=plt.subplot(1,2,1)sns.distplot(train_data['V0'],fit=stats.norm)ax=plt.subplot(1,2,2)res = stats.probplot(train_data['V0'], plot=plt)
复制代码



查看查看所有数据的直方图和 Q-Q 图,查看训练集的数据是否近似于正态分布


# train_cols = 6# train_rows = len(train_data.columns)# plt.figure(figsize=(4*train_cols,4*train_rows))
# i=0# for col in train_data.columns:# i+=1# ax=plt.subplot(train_rows,train_cols,i)# sns.distplot(train_data[col],fit=stats.norm) # i+=1# ax=plt.subplot(train_rows,train_cols,i)# res = stats.probplot(train_data[col], plot=plt)# plt.show()#QQ图自行打开
复制代码


由上面的数据分布图信息可以看出,很多特征变量(如'V1','V9','V24','V28'等)的数据分布不是正态的,数据并不跟随对角线,后续可以使用数据变换对数据进行转换。


对比同一特征变量‘V0’下,训练集数据和测试集数据的分布情况,查看数据分布是否一致


ax = sns.kdeplot(train_data['V0'], color="Red", shade=True)ax = sns.kdeplot(test_data['V0'], color="Blue", shade=True)ax.set_xlabel('V0')ax.set_ylabel("Frequency")ax = ax.legend(["train","test"])
复制代码



查看所有特征变量下,训练集数据和测试集数据的分布情况,分析并寻找出数据分布不一致的特征变量。


# dist_cols = 6# dist_rows = len(test_data.columns)# plt.figure(figsize=(4*dist_cols,4*dist_rows))
# i=1# for col in test_data.columns:# ax=plt.subplot(dist_rows,dist_cols,i)# ax = sns.kdeplot(train_data[col], color="Red", shade=True)# ax = sns.kdeplot(test_data[col], color="Blue", shade=True)# ax.set_xlabel(col)# ax.set_ylabel("Frequency")# ax = ax.legend(["train","test"]) # i+=1# plt.show()#自行打开
复制代码


查看特征'V5', 'V17', 'V28', 'V22', 'V11', 'V9'数据的数据分布


drop_col = 6drop_row = 1
plt.figure(figsize=(5*drop_col,5*drop_row))
i=1for col in ["V5","V9","V11","V17","V22","V28"]: ax =plt.subplot(drop_row,drop_col,i) ax = sns.kdeplot(train_data[col], color="Red", shade=True) ax = sns.kdeplot(test_data[col], color="Blue", shade=True) ax.set_xlabel(col) ax.set_ylabel("Frequency") ax = ax.legend(["train","test"]) i+=1plt.show()
复制代码



由上图的数据分布可以看到特征'V5','V9','V11','V17','V22','V28' 训练集数据与测试集数据分布不一致,会导致模型泛化能力差,采用删除此类特征方法。


drop_columns = ['V5','V9','V11','V17','V22','V28']# 合并训练集和测试集数据,并可视化训练集和测试集数据特征分布图
复制代码


可视化线性回归关系


  • 查看特征变量‘V0’与'target'变量的线性回归关系


fcols = 2frows = 1
plt.figure(figsize=(8,4))
ax=plt.subplot(1,2,1)sns.regplot(x='V0', y='target', data=train_data, ax=ax, scatter_kws={'marker':'.','s':3,'alpha':0.3}, line_kws={'color':'k'});plt.xlabel('V0')plt.ylabel('target')
ax=plt.subplot(1,2,2)sns.distplot(train_data['V0'].dropna())plt.xlabel('V0')
plt.show()
复制代码


1.2.2 查看变量间线性回归关系

# fcols = 6# frows = len(test_data.columns)# plt.figure(figsize=(5*fcols,4*frows))
# i=0# for col in test_data.columns:# i+=1# ax=plt.subplot(frows,fcols,i)# sns.regplot(x=col, y='target', data=train_data, ax=ax, # scatter_kws={'marker':'.','s':3,'alpha':0.3},# line_kws={'color':'k'});# plt.xlabel(col)# plt.ylabel('target') # i+=1# ax=plt.subplot(frows,fcols,i)# sns.distplot(train_data[col].dropna()) # plt.xlabel(col) #已注释图片生成,自行打开
复制代码

1.2.2 查看特征变量的相关性


data_train1 = train_data.drop(['V5','V9','V11','V17','V22','V28'],axis=1)train_corr = data_train1.corr()train_corr
复制代码


.dataframe tbody tr th:only-of-type { vertical-align: middle; }<pre><code>.dataframe&nbsp;tbody&nbsp;tr&nbsp;th&nbsp;{<br/>&nbsp;&nbsp;&nbsp;&nbsp;vertical-align:&nbsp;top;<br/>}<br/><br/>.dataframe&nbsp;thead&nbsp;th&nbsp;{<br/>&nbsp;&nbsp;&nbsp;&nbsp;text-align:&nbsp;right;<br/>}<br/></code></pre><p>



33 rows × 33 columns


# 画出相关性热力图ax = plt.subplots(figsize=(20, 16))#调整画布大小
ax = sns.heatmap(train_corr, vmax=.8, square=True, annot=True)#画热力图 annot=True 显示系数
复制代码



# 找出相关程度data_train1 = train_data.drop(['V5','V9','V11','V17','V22','V28'],axis=1)
plt.figure(figsize=(20, 16)) # 指定绘图对象宽度和高度colnm = data_train1.columns.tolist() # 列表头mcorr = data_train1[colnm].corr(method="spearman") # 相关系数矩阵,即给出了任意两个变量之间的相关系数mask = np.zeros_like(mcorr, dtype=np.bool) # 构造与mcorr同维数矩阵 为bool型mask[np.triu_indices_from(mask)] = True # 角分线右侧为Truecmap = sns.diverging_palette(220, 10, as_cmap=True) # 返回matplotlib colormap对象g = sns.heatmap(mcorr, mask=mask, cmap=cmap, square=True, annot=True, fmt='0.2f') # 热力图(看两两相似度)plt.show()
复制代码



上图为所有特征变量和 target 变量两两之间的相关系数,由此可以看出各个特征变量 V0-V37 之间的相关性以及特征变量 V0-V37 与 target 的相关性。

1.2.3 查找重要变量

查找出特征变量和 target 变量相关系数大于 0.5 的特征变量


#寻找K个最相关的特征信息k = 10 # number of variables for heatmapcols = train_corr.nlargest(k, 'target')['target'].index
cm = np.corrcoef(train_data[cols].values.T)hm = plt.subplots(figsize=(10, 10))#调整画布大小#hm = sns.heatmap(cm, cbar=True, annot=True, square=True)#g = sns.heatmap(train_data[cols].corr(),annot=True,square=True,cmap="RdYlGn")hm = sns.heatmap(train_data[cols].corr(),annot=True,square=True)
plt.show()
复制代码


threshold = 0.5
corrmat = train_data.corr()top_corr_features = corrmat.index[abs(corrmat["target"])>threshold]plt.figure(figsize=(10,10))g = sns.heatmap(train_data[top_corr_features].corr(),annot=True,cmap="RdYlGn")
复制代码


drop_columns.clear()drop_columns = ['V5','V9','V11','V17','V22','V28']
复制代码


# Threshold for removing correlated variablesthreshold = 0.5
# Absolute value correlation matrixcorr_matrix = data_train1.corr().abs()drop_col=corr_matrix[corr_matrix["target"]<threshold].index#data_all.drop(drop_col, axis=1, inplace=True)
复制代码


由于'V14', 'V21', 'V25', 'V26', 'V32', 'V33', 'V34'特征的相关系数值小于 0.5,故认为这些特征与最终的预测 target 值不相关,删除这些特征变量;


#merge train_set and test_settrain_x =  train_data.drop(['target'], axis=1)
#data_all=pd.concat([train_data,test_data],axis=0,ignore_index=True)data_all = pd.concat([train_x,test_data])

data_all.drop(drop_columns,axis=1,inplace=True)#View datadata_all.head()
复制代码


.dataframe tbody tr th:only-of-type { vertical-align: middle; }<pre><code>.dataframe&nbsp;tbody&nbsp;tr&nbsp;th&nbsp;{<br/>&nbsp;&nbsp;&nbsp;&nbsp;vertical-align:&nbsp;top;<br/>}<br/><br/>.dataframe&nbsp;thead&nbsp;th&nbsp;{<br/>&nbsp;&nbsp;&nbsp;&nbsp;text-align:&nbsp;right;<br/>}<br/></code></pre><p>



5 rows × 32 columns


# normalise numeric columnscols_numeric=list(data_all.columns)
def scale_minmax(col): return (col-col.min())/(col.max()-col.min())
data_all[cols_numeric] = data_all[cols_numeric].apply(scale_minmax,axis=0)data_all[cols_numeric].describe()
复制代码


.dataframe tbody tr th:only-of-type { vertical-align: middle; }<pre><code>.dataframe&nbsp;tbody&nbsp;tr&nbsp;th&nbsp;{<br/>&nbsp;&nbsp;&nbsp;&nbsp;vertical-align:&nbsp;top;<br/>}<br/><br/>.dataframe&nbsp;thead&nbsp;th&nbsp;{<br/>&nbsp;&nbsp;&nbsp;&nbsp;text-align:&nbsp;right;<br/>}<br/></code></pre><p>



8 rows × 32 columns


#col_data_process = cols_numeric.append('target')train_data_process = train_data[cols_numeric]train_data_process = train_data_process[cols_numeric].apply(scale_minmax,axis=0)
test_data_process = test_data[cols_numeric]test_data_process = test_data_process[cols_numeric].apply(scale_minmax,axis=0)
复制代码


cols_numeric_left = cols_numeric[0:13]cols_numeric_right = cols_numeric[13:]
复制代码


## Check effect of Box-Cox transforms on distributions of continuous variables
train_data_process = pd.concat([train_data_process, train_data['target']], axis=1)
fcols = 6frows = len(cols_numeric_left)plt.figure(figsize=(4*fcols,4*frows))i=0
for var in cols_numeric_left: dat = train_data_process[[var, 'target']].dropna() i+=1 plt.subplot(frows,fcols,i) sns.distplot(dat[var] , fit=stats.norm); plt.title(var+' Original') plt.xlabel('') i+=1 plt.subplot(frows,fcols,i) _=stats.probplot(dat[var], plot=plt) plt.title('skew='+'{:.4f}'.format(stats.skew(dat[var]))) plt.xlabel('') plt.ylabel('') i+=1 plt.subplot(frows,fcols,i) plt.plot(dat[var], dat['target'],'.',alpha=0.5) plt.title('corr='+'{:.2f}'.format(np.corrcoef(dat[var], dat['target'])[0][1])) i+=1 plt.subplot(frows,fcols,i) trans_var, lambda_var = stats.boxcox(dat[var].dropna()+1) trans_var = scale_minmax(trans_var) sns.distplot(trans_var , fit=stats.norm); plt.title(var+' Tramsformed') plt.xlabel('') i+=1 plt.subplot(frows,fcols,i) _=stats.probplot(trans_var, plot=plt) plt.title('skew='+'{:.4f}'.format(stats.skew(trans_var))) plt.xlabel('') plt.ylabel('') i+=1 plt.subplot(frows,fcols,i) plt.plot(trans_var, dat['target'],'.',alpha=0.5) plt.title('corr='+'{:.2f}'.format(np.corrcoef(trans_var,dat['target'])[0][1]))
复制代码


# ## Check effect of Box-Cox transforms on distributions of continuous variables
#已注释图片生成,自行打开

# fcols = 6# frows = len(cols_numeric_right)# plt.figure(figsize=(4*fcols,4*frows))# i=0
# for var in cols_numeric_right:# dat = train_data_process[[var, 'target']].dropna() # i+=1# plt.subplot(frows,fcols,i)# sns.distplot(dat[var] , fit=stats.norm);# plt.title(var+' Original')# plt.xlabel('') # i+=1# plt.subplot(frows,fcols,i)# _=stats.probplot(dat[var], plot=plt)# plt.title('skew='+'{:.4f}'.format(stats.skew(dat[var])))# plt.xlabel('')# plt.ylabel('') # i+=1# plt.subplot(frows,fcols,i)# plt.plot(dat[var], dat['target'],'.',alpha=0.5)# plt.title('corr='+'{:.2f}'.format(np.corrcoef(dat[var], dat['target'])[0][1])) # i+=1# plt.subplot(frows,fcols,i)# trans_var, lambda_var = stats.boxcox(dat[var].dropna()+1)# trans_var = scale_minmax(trans_var) # sns.distplot(trans_var , fit=stats.norm);# plt.title(var+' Tramsformed')# plt.xlabel('') # i+=1# plt.subplot(frows,fcols,i)# _=stats.probplot(trans_var, plot=plt)# plt.title('skew='+'{:.4f}'.format(stats.skew(trans_var)))# plt.xlabel('')# plt.ylabel('') # i+=1# plt.subplot(frows,fcols,i)# plt.plot(trans_var, dat['target'],'.',alpha=0.5)# plt.title('corr='+'{:.2f}'.format(np.corrcoef(trans_var,dat['target'])[0][1]))
复制代码

2.数据特征工程

2.1 数据预处理和特征处理

# 导入数据分析工具包import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as sns
from scipy import stats
import warningswarnings.filterwarnings("ignore") %matplotlib inline
# 读取数据train_data_file = "./zhengqi_train.txt"test_data_file = "./zhengqi_test.txt"
train_data = pd.read_csv(train_data_file, sep='\t', encoding='utf-8')test_data = pd.read_csv(test_data_file, sep='\t', encoding='utf-8')
复制代码


train_data.describe()#数据总览
复制代码


.dataframe tbody tr th:only-of-type { vertical-align: middle; }<pre><code>.dataframe&nbsp;tbody&nbsp;tr&nbsp;th&nbsp;{<br/>&nbsp;&nbsp;&nbsp;&nbsp;vertical-align:&nbsp;top;<br/>}<br/><br/>.dataframe&nbsp;thead&nbsp;th&nbsp;{<br/>&nbsp;&nbsp;&nbsp;&nbsp;text-align:&nbsp;right;<br/>}<br/></code></pre><p>



8 rows × 39 columns

2.1.1 异常值分析

#异常值分析plt.figure(figsize=(18, 10))plt.boxplot(x=train_data.values,labels=train_data.columns)plt.hlines([-7.5, 7.5], 0, 40, colors='r')plt.show()
复制代码



## 删除异常值train_data = train_data[train_data['V9']>-7.5]train_data.describe()
复制代码


.dataframe tbody tr th:only-of-type { vertical-align: middle; }<pre><code>.dataframe&nbsp;tbody&nbsp;tr&nbsp;th&nbsp;{<br/>&nbsp;&nbsp;&nbsp;&nbsp;vertical-align:&nbsp;top;<br/>}<br/><br/>.dataframe&nbsp;thead&nbsp;th&nbsp;{<br/>&nbsp;&nbsp;&nbsp;&nbsp;text-align:&nbsp;right;<br/>}<br/></code></pre><p>



8 rows × 39 columns


test_data.describe()
复制代码


.dataframe tbody tr th:only-of-type { vertical-align: middle; }<pre><code>.dataframe&nbsp;tbody&nbsp;tr&nbsp;th&nbsp;{<br/>&nbsp;&nbsp;&nbsp;&nbsp;vertical-align:&nbsp;top;<br/>}<br/><br/>.dataframe&nbsp;thead&nbsp;th&nbsp;{<br/>&nbsp;&nbsp;&nbsp;&nbsp;text-align:&nbsp;right;<br/>}<br/></code></pre><p>



8 rows × 38 columns

2.1.2 归一化处理


from sklearn import preprocessing
features_columns = [col for col in train_data.columns if col not in ['target']]
min_max_scaler = preprocessing.MinMaxScaler()
min_max_scaler = min_max_scaler.fit(train_data[features_columns])
train_data_scaler = min_max_scaler.transform(train_data[features_columns])test_data_scaler = min_max_scaler.transform(test_data[features_columns])
train_data_scaler = pd.DataFrame(train_data_scaler)train_data_scaler.columns = features_columns
test_data_scaler = pd.DataFrame(test_data_scaler)test_data_scaler.columns = features_columns
train_data_scaler['target'] = train_data['target']
复制代码


train_data_scaler.describe()
test_data_scaler.describe()
复制代码


.dataframe tbody tr th:only-of-type { vertical-align: middle; }<pre><code>.dataframe&nbsp;tbody&nbsp;tr&nbsp;th&nbsp;{<br/>&nbsp;&nbsp;&nbsp;&nbsp;vertical-align:&nbsp;top;<br/>}<br/><br/>.dataframe&nbsp;thead&nbsp;th&nbsp;{<br/>&nbsp;&nbsp;&nbsp;&nbsp;text-align:&nbsp;right;<br/>}<br/></code></pre><p>



8 rows × 38 columns


#查看数据集情况dist_cols = 6dist_rows = len(test_data_scaler.columns)
plt.figure(figsize=(4*dist_cols,4*dist_rows))

for i, col in enumerate(test_data_scaler.columns): ax=plt.subplot(dist_rows,dist_cols,i+1) ax = sns.kdeplot(train_data_scaler[col], color="Red", shade=True) ax = sns.kdeplot(test_data_scaler[col], color="Blue", shade=True) ax.set_xlabel(col) ax.set_ylabel("Frequency") ax = ax.legend(["train","test"]) # plt.show() #已注释图片生成,自行打开
复制代码


查看特征'V5', 'V17', 'V28', 'V22', 'V11', 'V9'数据的数据分布


drop_col = 6drop_row = 1
plt.figure(figsize=(5*drop_col,5*drop_row))
for i, col in enumerate(["V5","V9","V11","V17","V22","V28"]): ax =plt.subplot(drop_row,drop_col,i+1) ax = sns.kdeplot(train_data_scaler[col], color="Red", shade=True) ax= sns.kdeplot(test_data_scaler[col], color="Blue", shade=True) ax.set_xlabel(col) ax.set_ylabel("Frequency") ax = ax.legend(["train","test"])plt.show()
复制代码


这几个特征下,训练集的数据和测试集的数据分布不一致,会影响模型的泛化能力,故删除这些特征

3.1.3 特征相关性

plt.figure(figsize=(20, 16))  column = train_data_scaler.columns.tolist()  mcorr = train_data_scaler[column].corr(method="spearman")  mask = np.zeros_like(mcorr, dtype=np.bool)  mask[np.triu_indices_from(mask)] = True  cmap = sns.diverging_palette(220, 10, as_cmap=True)  g = sns.heatmap(mcorr, mask=mask, cmap=cmap, square=True, annot=True, fmt='0.2f')  plt.show()
复制代码

2.2 特征降维

mcorr=mcorr.abs()numerical_corr=mcorr[mcorr['target']>0.1]['target']print(numerical_corr.sort_values(ascending=False))
index0 = numerical_corr.sort_values(ascending=False).indexprint(train_data_scaler[index0].corr('spearman'))
复制代码


target    1.000000V0        0.712403V31       0.711636V1        0.682909V8        0.679469V27       0.657398V2        0.585850V16       0.545793V3        0.501622V4        0.478683V12       0.460300V10       0.448682V36       0.425991V37       0.376443V24       0.305526V5        0.286076V6        0.280195V20       0.278381V11       0.234551V15       0.221290V29       0.190109V7        0.185321V19       0.180111V18       0.149741V13       0.149199V17       0.126262V22       0.112743V30       0.101378Name: target, dtype: float64          target        V0       V31        V1        V8       V27        V2  \target  1.000000  0.712403  0.711636  0.682909  0.679469  0.657398  0.585850   V0      0.712403  1.000000  0.739116  0.894116  0.832151  0.763128  0.516817   V31     0.711636  0.739116  1.000000  0.807585  0.841469  0.765750  0.589890   V1      0.682909  0.894116  0.807585  1.000000  0.849034  0.807102  0.490239   V8      0.679469  0.832151  0.841469  0.849034  1.000000  0.887119  0.676417   V27     0.657398  0.763128  0.765750  0.807102  0.887119  1.000000  0.709534   V2      0.585850  0.516817  0.589890  0.490239  0.676417  0.709534  1.000000   V16     0.545793  0.388852  0.642309  0.396122  0.642156  0.620981  0.783643   V3      0.501622  0.401150  0.420134  0.363749  0.400915  0.402468  0.417190   V4      0.478683  0.697430  0.521226  0.651615  0.455801  0.424260  0.062134   V12     0.460300  0.640696  0.471528  0.596173  0.368572  0.336190  0.055734   V10     0.448682  0.279350  0.445335  0.255763  0.351127  0.203066  0.292769   V36     0.425991  0.214930  0.390250  0.192985  0.263291  0.186131  0.259475   V37    -0.376443 -0.472200 -0.301906 -0.397080 -0.507057 -0.557098 -0.731786   V24    -0.305526 -0.336325 -0.267968 -0.289742 -0.148323 -0.153834  0.018458   V5     -0.286076 -0.356704 -0.162304 -0.242776 -0.188993 -0.222596 -0.324464   V6      0.280195  0.131507  0.340145  0.147037  0.355064  0.356526  0.546921   V20     0.278381  0.444939  0.349530  0.421987  0.408853  0.361040  0.293635   V11    -0.234551 -0.333101 -0.131425 -0.221910 -0.161792 -0.190952 -0.271868   V15     0.221290  0.334135  0.110674  0.230395  0.054701  0.007156 -0.206499   V29     0.190109  0.334603  0.121833  0.240964  0.050211  0.006048 -0.255559   V7      0.185321  0.075732  0.277283  0.082766  0.278231  0.290620  0.378984   V19    -0.180111 -0.144295 -0.183185 -0.146559 -0.170237 -0.228613 -0.179416   V18     0.149741  0.132143  0.094678  0.093688  0.079592  0.091660  0.114929   V13     0.149199  0.173861  0.071517  0.134595  0.105380  0.126831  0.180477   V17     0.126262  0.055024  0.115056  0.081446  0.102544  0.036520 -0.050935   V22    -0.112743 -0.076698 -0.106450 -0.072848 -0.078333 -0.111196 -0.241206   V30     0.101378  0.099242  0.131453  0.109216  0.165204  0.167073  0.176236   
V16 V3 V4 ... V11 V15 V29 \target 0.545793 0.501622 0.478683 ... -0.234551 0.221290 0.190109 V0 0.388852 0.401150 0.697430 ... -0.333101 0.334135 0.334603 V31 0.642309 0.420134 0.521226 ... -0.131425 0.110674 0.121833 V1 0.396122 0.363749 0.651615 ... -0.221910 0.230395 0.240964 V8 0.642156 0.400915 0.455801 ... -0.161792 0.054701 0.050211 V27 0.620981 0.402468 0.424260 ... -0.190952 0.007156 0.006048 V2 0.783643 0.417190 0.062134 ... -0.271868 -0.206499 -0.255559 V16 1.000000 0.388886 0.009749 ... -0.088716 -0.280952 -0.327558 V3 0.388886 1.000000 0.294049 ... -0.126924 0.145291 0.128079 V4 0.009749 0.294049 1.000000 ... -0.164113 0.641180 0.692626 V12 -0.024541 0.286500 0.897807 ... -0.232228 0.703861 0.732617 V10 0.473009 0.295181 0.123829 ... 0.049969 -0.014449 -0.060440 V36 0.469130 0.299063 0.099359 ... -0.017805 -0.012844 -0.051097 V37 -0.431507 -0.219751 0.040396 ... 0.455998 0.234751 0.273926 V24 0.064523 -0.237022 -0.558334 ... 0.170969 -0.687353 -0.677833 V5 -0.045495 -0.230466 -0.248061 ... 0.797583 -0.250027 -0.233233 V6 0.760362 0.181135 -0.204780 ... -0.170545 -0.443436 -0.486682 V20 0.239572 0.270647 0.257815 ... -0.138684 0.050867 0.035022 V11 -0.088716 -0.126924 -0.164113 ... 1.000000 -0.123004 -0.120982 V15 -0.280952 0.145291 0.641180 ... -0.123004 1.000000 0.947360 V29 -0.327558 0.128079 0.692626 ... -0.120982 0.947360 1.000000 V7 0.651907 0.132564 -0.150577 ... -0.097623 -0.335054 -0.360490 V19 -0.019645 -0.265940 -0.237529 ... -0.094150 -0.215364 -0.212691 V18 0.066147 0.014697 0.135792 ... -0.153625 0.109030 0.098474 V13 0.074214 -0.019453 0.061801 ... -0.436341 0.047845 0.024514 V17 0.172978 0.067720 0.060753 ... 0.192222 -0.004555 -0.006498 V22 -0.091204 -0.305218 0.021174 ... 0.079577 0.069993 0.072070 V30 0.217428 0.055660 -0.053976 ... -0.102750 -0.147541 -0.161966
V7 V19 V18 V13 V17 V22 V30 target 0.185321 -0.180111 0.149741 0.149199 0.126262 -0.112743 0.101378 V0 0.075732 -0.144295 0.132143 0.173861 0.055024 -0.076698 0.099242 V31 0.277283 -0.183185 0.094678 0.071517 0.115056 -0.106450 0.131453 V1 0.082766 -0.146559 0.093688 0.134595 0.081446 -0.072848 0.109216 V8 0.278231 -0.170237 0.079592 0.105380 0.102544 -0.078333 0.165204 V27 0.290620 -0.228613 0.091660 0.126831 0.036520 -0.111196 0.167073 V2 0.378984 -0.179416 0.114929 0.180477 -0.050935 -0.241206 0.176236 V16 0.651907 -0.019645 0.066147 0.074214 0.172978 -0.091204 0.217428 V3 0.132564 -0.265940 0.014697 -0.019453 0.067720 -0.305218 0.055660 V4 -0.150577 -0.237529 0.135792 0.061801 0.060753 0.021174 -0.053976 V12 -0.157087 -0.174034 0.125965 0.102293 0.012429 -0.004863 -0.054432 V10 0.242818 0.089046 0.038237 -0.100776 0.258885 -0.132951 0.027257 V36 0.268044 0.099034 0.066478 -0.068582 0.298962 -0.136943 0.056802 V37 -0.284305 0.025241 -0.097699 -0.344661 0.052673 0.110455 -0.176127 V24 0.076407 0.287262 -0.221117 -0.073906 0.094367 0.081279 0.079363 V5 0.118541 0.247903 -0.191786 -0.408978 0.342555 0.143785 0.020252 V6 0.904614 0.292661 0.061109 0.088866 0.094702 -0.102842 0.201834 V20 0.064205 0.029483 0.050529 0.004600 0.061369 -0.092706 0.035036 V11 -0.097623 -0.094150 -0.153625 -0.436341 0.192222 0.079577 -0.102750 V15 -0.335054 -0.215364 0.109030 0.047845 -0.004555 0.069993 -0.147541 V29 -0.360490 -0.212691 0.098474 0.024514 -0.006498 0.072070 -0.161966 V7 1.000000 0.269472 0.032519 0.059724 0.178034 0.058178 0.196347 V19 0.269472 1.000000 -0.034215 -0.106162 0.250114 0.075582 0.120766 V18 0.032519 -0.034215 1.000000 0.242008 -0.073678 0.016819 0.133708 V13 0.059724 -0.106162 0.242008 1.000000 -0.108020 0.348432 -0.097178 V17 0.178034 0.250114 -0.073678 -0.108020 1.000000 0.363785 0.057480 V22 0.058178 0.075582 0.016819 0.348432 0.363785 1.000000 -0.054570 V30 0.196347 0.120766 0.133708 -0.097178 0.057480 -0.054570 1.000000
[28 rows x 28 columns]
复制代码

2.2.1 相关性初筛

features_corr = numerical_corr.sort_values(ascending=False).reset_index()features_corr.columns = ['features_and_target', 'corr']features_corr_select = features_corr[features_corr['corr']>0.3] # 筛选出大于相关性大于0.3的特征print(features_corr_select)select_features = [col for col in features_corr_select['features_and_target'] if col not in ['target']]new_train_data_corr_select = train_data_scaler[select_features+['target']]new_test_data_corr_select = test_data_scaler[select_features]
复制代码


   features_and_target      corr0               target  1.0000001                   V0  0.7124032                  V31  0.7116363                   V1  0.6829094                   V8  0.6794695                  V27  0.6573986                   V2  0.5858507                  V16  0.5457938                   V3  0.5016229                   V4  0.47868310                 V12  0.46030011                 V10  0.44868212                 V36  0.42599113                 V37  0.37644314                 V24  0.305526
复制代码

2.2.2 多重共线性分析

!pip install statsmodels -i https://pypi.tuna.tsinghua.edu.cn/simple
复制代码


Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simpleRequirement already satisfied: statsmodels in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (0.13.5)Requirement already satisfied: scipy>=1.3 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from statsmodels) (1.6.3)Requirement already satisfied: pandas>=0.25 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from statsmodels) (1.1.5)Requirement already satisfied: packaging>=21.3 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from statsmodels) (21.3)Requirement already satisfied: numpy>=1.17 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from statsmodels) (1.19.5)Requirement already satisfied: patsy>=0.5.2 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from statsmodels) (0.5.3)Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from packaging>=21.3->statsmodels) (3.0.9)Requirement already satisfied: pytz>=2017.2 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from pandas>=0.25->statsmodels) (2019.3)Requirement already satisfied: python-dateutil>=2.7.3 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from pandas>=0.25->statsmodels) (2.8.2)Requirement already satisfied: six in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from patsy>=0.5.2->statsmodels) (1.16.0)
[notice] A new release of pip available: 22.1.2 -> 23.0.1[notice] To update, run: pip install --upgrade pip
复制代码


from statsmodels.stats.outliers_influence import variance_inflation_factor #多重共线性方差膨胀因子
#多重共线性new_numerical=['V0', 'V2', 'V3', 'V4', 'V5', 'V6', 'V10','V11', 'V13', 'V15', 'V16', 'V18', 'V19', 'V20', 'V22','V24','V30', 'V31', 'V37']X=np.matrix(train_data_scaler[new_numerical])VIF_list=[variance_inflation_factor(X, i) for i in range(X.shape[1])]VIF_list
复制代码


[216.73387180903222, 114.38118723828812, 27.863778129686356, 201.96436579080174, 78.93722825798903, 151.06983667656212, 14.519604941508451, 82.69750284665385, 28.479378440614585, 27.759176471505945, 526.6483470743831, 23.50166642638334, 19.920315849901424, 24.640481765008683, 11.816055964845381, 4.958208708452915, 37.09877416736591, 298.26442986612767, 47.854002539887034]
复制代码

2.2.3 PCA 处理降维

from sklearn.decomposition import PCA   #主成分分析法
#PCA方法降维#保持90%的信息pca = PCA(n_components=0.9)new_train_pca_90 = pca.fit_transform(train_data_scaler.iloc[:,0:-1])new_test_pca_90 = pca.transform(test_data_scaler)new_train_pca_90 = pd.DataFrame(new_train_pca_90)new_test_pca_90 = pd.DataFrame(new_test_pca_90)new_train_pca_90['target'] = train_data_scaler['target']new_train_pca_90.describe()
复制代码


.dataframe tbody tr th:only-of-type { vertical-align: middle; }<pre><code>.dataframe&nbsp;tbody&nbsp;tr&nbsp;th&nbsp;{<br/>&nbsp;&nbsp;&nbsp;&nbsp;vertical-align:&nbsp;top;<br/>}<br/><br/>.dataframe&nbsp;thead&nbsp;th&nbsp;{<br/>&nbsp;&nbsp;&nbsp;&nbsp;text-align:&nbsp;right;<br/>}<br/></code></pre><p>



train_data_scaler.describe()
复制代码


.dataframe tbody tr th:only-of-type { vertical-align: middle; }<pre><code>.dataframe&nbsp;tbody&nbsp;tr&nbsp;th&nbsp;{<br/>&nbsp;&nbsp;&nbsp;&nbsp;vertical-align:&nbsp;top;<br/>}<br/><br/>.dataframe&nbsp;thead&nbsp;th&nbsp;{<br/>&nbsp;&nbsp;&nbsp;&nbsp;text-align:&nbsp;right;<br/>}<br/></code></pre><p>



8 rows × 39 columns


#PCA方法降维#保留16个主成分pca = PCA(n_components=0.95)new_train_pca_16 = pca.fit_transform(train_data_scaler.iloc[:,0:-1])new_test_pca_16 = pca.transform(test_data_scaler)new_train_pca_16 = pd.DataFrame(new_train_pca_16)new_test_pca_16 = pd.DataFrame(new_test_pca_16)new_train_pca_16['target'] = train_data_scaler['target']new_train_pca_16.describe()
复制代码


.dataframe tbody tr th:only-of-type { vertical-align: middle; }<pre><code>.dataframe&nbsp;tbody&nbsp;tr&nbsp;th&nbsp;{<br/>&nbsp;&nbsp;&nbsp;&nbsp;vertical-align:&nbsp;top;<br/>}<br/><br/>.dataframe&nbsp;thead&nbsp;th&nbsp;{<br/>&nbsp;&nbsp;&nbsp;&nbsp;text-align:&nbsp;right;<br/>}<br/></code></pre><p>



8 rows × 22 columns

3.模型训练

3.1 回归及相关模型

## 导入相关库from sklearn.linear_model import LinearRegression  #线性回归from sklearn.neighbors import KNeighborsRegressor  #K近邻回归from sklearn.tree import DecisionTreeRegressor     #决策树回归from sklearn.ensemble import RandomForestRegressor #随机森林回归from sklearn.svm import SVR  #支持向量回归import lightgbm as lgb #lightGbm模型from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split # 切分数据from sklearn.metrics import mean_squared_error #评价指标
from sklearn.model_selection import learning_curvefrom sklearn.model_selection import ShuffleSplit
## 切分训练数据和线下验证数据
#采用 pca 保留16维特征的数据new_train_pca_16 = new_train_pca_16.fillna(0)train = new_train_pca_16[new_test_pca_16.columns]target = new_train_pca_16['target']
# 切分数据 训练数据80% 验证数据20%train_data,test_data,train_target,test_target=train_test_split(train,target,test_size=0.2,random_state=0)
复制代码

3.1.1 多元线性回归模型

clf = LinearRegression()clf.fit(train_data, train_target)score = mean_squared_error(test_target, clf.predict(test_data))print("LinearRegression:   ", score)
train_score = []test_score = []
# 给予不同的数据量,查看模型的学习效果for i in range(10, len(train_data)+1, 10): lin_reg = LinearRegression() lin_reg.fit(train_data[:i], train_target[:i]) # LinearRegression().fit(X_train[:i], y_train[:i]) # 查看模型的预测情况:两种,模型基于训练数据集预测的情况(可以理解为模型拟合训练数据集的情况),模型基于测试数据集预测的情况 # 此处使用 lin_reg.predict(X_train[:i]),为训练模型的全部数据集 y_train_predict = lin_reg.predict(train_data[:i]) train_score.append(mean_squared_error(train_target[:i], y_train_predict)) y_test_predict = lin_reg.predict(test_data) test_score.append(mean_squared_error(test_target, y_test_predict)) # np.sqrt(train_score):将列表 train_score 中的数开平方plt.plot([i for i in range(1, len(train_score)+1)], train_score, label='train')plt.plot([i for i in range(1, len(test_score)+1)], test_score, label='test')
# plt.legend():显示图例(如图形的 label);plt.legend()plt.show()
复制代码


LinearRegression:    0.2642337917628173
复制代码


定义绘制模型学习曲线函数

def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,                        n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5)):    plt.figure()    plt.title(title)    if ylim is not None:        plt.ylim(*ylim)    plt.xlabel("Training examples")    plt.ylabel("Score")    train_sizes, train_scores, test_scores = learning_curve(        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)    train_scores_mean = np.mean(train_scores, axis=1)    train_scores_std = np.std(train_scores, axis=1)    test_scores_mean = np.mean(test_scores, axis=1)    test_scores_std = np.std(test_scores, axis=1)        print(train_scores_mean)    print(test_scores_mean)        plt.grid()     plt.fill_between(train_sizes, train_scores_mean - train_scores_std,                     train_scores_mean + train_scores_std, alpha=0.1,                     color="r")    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,                     test_scores_mean + test_scores_std, alpha=0.1, color="g")    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",             label="Training score")    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",             label="Cross-validation score")     plt.legend(loc="best")    return plt
复制代码


def plot_learning_curve_old(algo, X_train, X_test, y_train, y_test):    """绘制学习曲线:只需要传入算法(或实例对象)、X_train、X_test、y_train、y_test"""    """当使用该函数时传入算法,该算法的变量要进行实例化,如:PolynomialRegression(degree=2),变量 degree 要进行实例化"""    train_score = []    test_score = []    for i in range(10, len(X_train)+1, 10):        algo.fit(X_train[:i], y_train[:i])                y_train_predict = algo.predict(X_train[:i])        train_score.append(mean_squared_error(y_train[:i], y_train_predict))            y_test_predict = algo.predict(X_test)        test_score.append(mean_squared_error(y_test, y_test_predict))        plt.plot([i for i in range(1, len(train_score)+1)],            train_score, label="train")    plt.plot([i for i in range(1, len(test_score)+1)],            test_score, label="test")        plt.legend()    plt.show()
复制代码


# plot_learning_curve_old(LinearRegression(), train_data, test_data, train_target, test_target)
复制代码


# 线性回归模型学习曲线X = train_data.valuesy = train_target.values # 图一title = r"LinearRegression"cv = ShuffleSplit(n_splits=100, test_size=0.2, random_state=0)estimator = LinearRegression()    #建模plot_learning_curve(estimator, title, X, y, ylim=(0.5, 0.8), cv=cv, n_jobs=1)
复制代码


[0.70183463 0.66761103 0.66101945 0.65732898 0.65360375][0.57364886 0.61882339 0.62809368 0.63012866 0.63158596]




<module 'matplotlib.pyplot' from '/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/matplotlib/pyplot.py'>
复制代码


3.1.2 KNN 近邻回归

for i in range(3,10):    clf = KNeighborsRegressor(n_neighbors=i) # 最近三个    clf.fit(train_data, train_target)    score = mean_squared_error(test_target, clf.predict(test_data))    print("KNeighborsRegressor:   ", score)
复制代码


KNeighborsRegressor:    0.27619208861976163KNeighborsRegressor:    0.2597627823313149KNeighborsRegressor:    0.2628212724567474KNeighborsRegressor:    0.26670982271241833KNeighborsRegressor:    0.2659603905091448KNeighborsRegressor:    0.26353694644788067KNeighborsRegressor:    0.2673470579477979
复制代码


# plot_learning_curve_old(KNeighborsRegressor(n_neighbors=5) , train_data, test_data, train_target, test_target)
复制代码


# 绘制K近邻回归学习曲线X = train_data.valuesy = train_target.values # K近邻回归title = r"KNeighborsRegressor"cv = ShuffleSplit(n_splits=100, test_size=0.2, random_state=0)
estimator = KNeighborsRegressor(n_neighbors=8) #建模plot_learning_curve(estimator, title, X, y, ylim=(0.3, 0.9), cv=cv, n_jobs=1)
复制代码


[0.61581146 0.68763995 0.71414969 0.73084172 0.73976273][0.50369207 0.58753672 0.61969929 0.64062459 0.6560054 ]




<module 'matplotlib.pyplot' from '/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/matplotlib/pyplot.py'>
复制代码


3.1.3 决策树回归

clf = DecisionTreeRegressor() clf.fit(train_data, train_target)score = mean_squared_error(test_target, clf.predict(test_data))print("DecisionTreeRegressor:   ", score)
复制代码


DecisionTreeRegressor:    0.6405298823529413
复制代码


# plot_learning_curve_old(DecisionTreeRegressor(), train_data, test_data, train_target, test_target)
复制代码


X = train_data.valuesy = train_target.values # 决策树回归title = r"DecisionTreeRegressor"cv = ShuffleSplit(n_splits=100, test_size=0.2, random_state=0)
estimator = DecisionTreeRegressor() #建模plot_learning_curve(estimator, title, X, y, ylim=(0.1, 1.3), cv=cv, n_jobs=1)
复制代码


[1. 1. 1. 1. 1.][0.11833987 0.22982731 0.2797608  0.30950084 0.32628853]




<module 'matplotlib.pyplot' from '/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/matplotlib/pyplot.py'>
复制代码


3.1.4 随机森林回归

clf = RandomForestRegressor(n_estimators=200) # 200棵树模型clf.fit(train_data, train_target)score = mean_squared_error(test_target, clf.predict(test_data))print("RandomForestRegressor:   ", score)# plot_learning_curve_old(RandomForestRegressor(n_estimators=200), train_data, test_data, train_target, test_target)
复制代码


RandomForestRegressor:    0.24087959640588236
复制代码


X = train_data.valuesy = train_target.values # 随机森林title = r"RandomForestRegressor"cv = ShuffleSplit(n_splits=100, test_size=0.2, random_state=0)
estimator = RandomForestRegressor(n_estimators=200) #建模plot_learning_curve(estimator, title, X, y, ylim=(0.4, 1.0), cv=cv, n_jobs=1)
复制代码


[0.93619796 0.94798334 0.95197393 0.95415054 0.95570763][0.53953995 0.61531165 0.64366926 0.65941678 0.67319725]




<module 'matplotlib.pyplot' from '/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/matplotlib/pyplot.py'>
复制代码


3.1.5 Gradient Boosting

from sklearn.ensemble import GradientBoostingRegressor
myGBR = GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None, learning_rate=0.03, loss='huber', max_depth=14, max_features='sqrt', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=10, min_samples_split=40, min_weight_fraction_leaf=0.0, n_estimators=10, warm_start=False)# 参数已删除 presort=True, random_state=10, subsample=0.8, verbose=0,
myGBR.fit(train_data, train_target)score = mean_squared_error(test_target, clf.predict(test_data))print("GradientBoostingRegressor: ", score)

myGBR = GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None, learning_rate=0.03, loss='huber', max_depth=14, max_features='sqrt', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=10, min_samples_split=40, min_weight_fraction_leaf=0.0, n_estimators=10, warm_start=False)#为了快速展示n_estimators设置较小,实战中请按需设置
# plot_learning_curve_old(myGBR, train_data, test_data, train_target, test_target)
复制代码


GradientBoostingRegressor:    0.906640574789251
复制代码


X = train_data.valuesy = train_target.values # GradientBoostingtitle = r"GradientBoostingRegressor"cv = ShuffleSplit(n_splits=10, test_size=0.2, random_state=0)
estimator = GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None, learning_rate=0.03, loss='huber', max_depth=14, max_features='sqrt', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=10, min_samples_split=40, min_weight_fraction_leaf=0.0, n_estimators=10, warm_start=False) #建模
plot_learning_curve(estimator, title, X, y, ylim=(0.4, 1.0), cv=cv, n_jobs=1)
#为了快速展示n_estimators设置较小,实战中请按需设置
复制代码

3.1.6 lightgbm 回归

# lgb回归模型clf = lgb.LGBMRegressor(        learning_rate=0.01,        max_depth=-1,        n_estimators=10,        boosting_type='gbdt',        random_state=2019,        objective='regression',    )# #为了快速展示n_estimators设置较小,实战中请按需设置# 训练模型clf.fit(        X=train_data, y=train_target,        eval_metric='MSE',        verbose=50    )
score = mean_squared_error(test_target, clf.predict(test_data))print("lightGbm: ", score)
复制代码


lightGbm:    0.906640574789251
复制代码


X = train_data.valuesy = train_target.values # LGBMtitle = r"LGBMRegressor"cv = ShuffleSplit(n_splits=10, test_size=0.2, random_state=0)
estimator = lgb.LGBMRegressor( learning_rate=0.01, max_depth=-1, n_estimators=10, boosting_type='gbdt', random_state=2019, objective='regression' ) #建模
plot_learning_curve(estimator, title, X, y, ylim=(0.4, 1.0), cv=cv, n_jobs=1)
#为了快速展示n_estimators设置较小,实战中请按需设置
复制代码

4.篇中总结

在工业蒸汽量预测上篇中,主要讲解了数据探索性分析:查看变量间相关性以及找出关键变量;数据特征工程对数据精进:异常值处理、归一化处理以及特征降维;在进行归回模型训练涉及主流 ML 模型:决策树、随机森林,lightgbm 等。下一篇中将着重讲解模型验证、特征优化、模型融合等。


原项目链接:https://www.heywhale.com/home/column/64141d6b1c8c8b518ba97dcc


参考链接:https://tianchi.aliyun.com/course/278/3427


发布于: 2023-03-30阅读数: 17
用户头像

汀丶

关注

本博客将不定期更新关于NLP等领域相关知识 2022-01-06 加入

本博客将不定期更新关于机器学习、强化学习、数据挖掘以及NLP等领域相关知识,以及分享自己学习到的知识技能,感谢大家关注!

评论

发布
暂无评论
机器学习实战系列[一]:工业蒸汽量预测(最新版本上篇)含数据探索特征工程等_数据挖掘_汀丶_InfoQ写作社区