写点什么

机器学习实战系列 [一]:工业蒸汽量预测(最新版本下篇)含特征优化模型融合等

作者:汀丶
  • 2023-03-31
    浙江
  • 本文字数:40634 字

    阅读完需:约 133 分钟

工业蒸汽量预测(最新版本下篇)

文末含码源

5.模型验证

5.1 模型评估的概念与正则化

5.1.1 过拟合与欠拟合

### 获取并绘制数据集import numpy as npimport matplotlib.pyplot as plt%matplotlib inline
np.random.seed(666)x = np.random.uniform(-3.0, 3.0, size=100)X = x.reshape(-1, 1)y = 0.5 * x**2 + x + 2 + np.random.normal(0, 1, size=100)
plt.scatter(x, y)plt.show()
复制代码



使用线性回归拟合数据


from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()lin_reg.fit(X, y)lin_reg.score(X, y)# 输出:0.4953707811865009
复制代码


0.4953707811865009
复制代码


准确率为 0.495,比较低,直线拟合数据的程度较低。


### 使用均方误差判断拟合程度from sklearn.metrics import mean_squared_error
y_predict = lin_reg.predict(X)mean_squared_error(y, y_predict)# 输出:3.0750025765636577
复制代码


3.0750025765636577
复制代码


### 绘制拟合结果y_predict = lin_reg.predict(X)plt.scatter(x, y)plt.plot(np.sort(x), y_predict[np.argsort(x)], color='r')plt.show()
复制代码


5.1.2 回归模型的评估指标和调用方法

### 使用多项式回归拟合# * 封装 Pipeline 管道from sklearn.pipeline import Pipelinefrom sklearn.preprocessing import PolynomialFeaturesfrom sklearn.preprocessing import StandardScaler
def PolynomialRegression(degree): return Pipeline([ ('poly', PolynomialFeatures(degree=degree)), ('std_scaler', StandardScaler()), ('lin_reg', LinearRegression()) ])
复制代码


  • 使用 Pipeline 拟合数据:degree = 2


poly2_reg = PolynomialRegression(degree=2)poly2_reg.fit(X, y)
y2_predict = poly2_reg.predict(X)
# 比较真值和预测值的均方误差mean_squared_error(y, y2_predict)# 输出:1.0987392142417856
复制代码


1.0987392142417856
复制代码


  • 绘制拟合结果


plt.scatter(x, y)plt.plot(np.sort(x), y2_predict[np.argsort(x)], color='r')plt.show()
复制代码



  • 调整 degree = 10


poly10_reg = PolynomialRegression(degree=10)poly10_reg.fit(X, y)
y10_predict = poly10_reg.predict(X)mean_squared_error(y, y10_predict)# 输出:1.0508466763764164

plt.scatter(x, y)plt.plot(np.sort(x), y10_predict[np.argsort(x)], color='r')plt.show()
复制代码



  • 调整 degree = 100


poly100_reg = PolynomialRegression(degree=100)poly100_reg.fit(X, y)
y100_predict = poly100_reg.predict(X)mean_squared_error(y, y100_predict)# 输出:0.6874357783433694

plt.scatter(x, y)plt.plot(np.sort(x), y100_predict[np.argsort(x)], color='r')plt.show()
复制代码



  • 分析

  • degree=2:均方误差为 1.0987392142417856;

  • degree=10:均方误差为 1.0508466763764164;

  • degree=100:均方误差为 0.6874357783433694;

  • degree 越大拟合的效果越好,因为样本点是一定的,我们总能找到一条曲线将所有的样本点拟合,也就是说将所有的样本点都完全落在这根曲线上,使得整体的均方误差为 0;

  • 红色曲线并不是所计算出的拟合曲线,而此红色曲线只是原有的数据点对应的 y 的预测值连接出来的结果,而且有的地方没有数据点,因此连接的结果和原来的曲线不一样;

5.1.3 交叉验证

  • 交叉验证迭代器


K 折交叉验证: KFold 将所有的样例划分为 k 个组,称为折叠 (fold) (如果 k = n, 这等价于 Leave One Out(留一) 策略),都具有相同的大小(如果可能)。预测函数学习时使用 k - 1 个折叠中的数据,最后一个剩下的折叠会用于测试。


K 折重复多次: RepeatedKFold 重复 K-Fold n 次。当需要运行时可以使用它 KFold n 次,在每次重复中产生不同的分割。


留一交叉验证: LeaveOneOut (或 LOO) 是一个简单的交叉验证。每个学习集都是通过除了一个样本以外的所有样本创建的,测试集是被留下的样本。 因此,对于 n 个样本,我们有 n 个不同的训练集和 n 个不同的测试集。这种交叉验证程序不会浪费太多数据,因为只有一个样本是从训练集中删除掉的:


留 P 交叉验证: LeavePOut 与 LeaveOneOut 非常相似,因为它通过从整个集合中删除 p 个样本来创建所有可能的 训练/测试集。对于 n 个样本,这产生了 {n \choose p} 个 训练-测试 对。与 LeaveOneOut 和 KFold 不同,当 p > 1 时,测试集会重叠。


用户自定义数据集划分: ShuffleSplit 迭代器将会生成一个用户给定数量的独立的训练/测试数据划分。样例首先被打散然后划分为一对训练测试集合。


设置每次生成的随机数相同: 可以通过设定明确的 random_state ,使得伪随机生成器的结果可以重复。


  • 基于类标签、具有分层的交叉验证迭代器


如何解决样本不平衡问题? 使用 StratifiedKFold 和 StratifiedShuffleSplit 分层抽样。 一些分类问题在目标类别的分布上可能表现出很大的不平衡性:例如,可能会出现比正样本多数倍的负样本。在这种情况下,建议采用如 StratifiedKFold 和 StratifiedShuffleSplit 中实现的分层抽样方法,确保相对的类别频率在每个训练和验证 折叠 中大致保留。


StratifiedKFold 是 k-fold 的变种,会返回 stratified(分层) 的折叠:每个小集合中, 各个类别的样例比例大致和完整数据集中相同。


StratifiedShuffleSplit 是 ShuffleSplit 的一个变种,会返回直接的划分,比如: 创建一个划分,但是划分中每个类的比例和完整数据集中的相同。


  • 用于分组数据的交叉验证迭代器


如何进一步测试模型的泛化能力? 留出一组特定的不属于测试集和训练集的数据。有时我们想知道在一组特定的 groups 上训练的模型是否能很好地适用于看不见的 group 。为了衡量这一点,我们需要确保验证对象中的所有样本来自配对训练折叠中完全没有表示的组。


GroupKFold 是 k-fold 的变体,它确保同一个 group 在测试和训练集中都不被表示。 例如,如果数据是从不同的 subjects 获得的,每个 subject 有多个样本,并且如果模型足够灵活以高度人物指定的特征中学习,则可能无法推广到新的 subject 。 GroupKFold 可以检测到这种过拟合的情况。


LeaveOneGroupOut 是一个交叉验证方案,它根据第三方提供的 array of integer groups (整数组的数组)来提供样本。这个组信息可以用来编码任意域特定的预定义交叉验证折叠。


每个训练集都是由除特定组别以外的所有样本构成的。


LeavePGroupsOut 类似于 LeaveOneGroupOut ,但为每个训练/测试集删除与 P 组有关的样本。


GroupShuffleSplit 迭代器是 ShuffleSplit 和 LeavePGroupsOut 的组合,它生成一个随机划分分区的序列,其中为每个分组提供了一个组子集。


  • 时间序列分割


TimeSeriesSplit 是 k-fold 的一个变体,它首先返回 k 折作为训练数据集,并且 (k+1) 折作为测试数据集。 请注意,与标准的交叉验证方法不同,连续的训练集是超越前者的超集。 另外,它将所有的剩余数据添加到第一个训练分区,它总是用来训练模型。


from sklearn.model_selection import train_test_split,cross_val_score,cross_validate # 交叉验证所需的函数from sklearn.model_selection import KFold,LeaveOneOut,LeavePOut,ShuffleSplit # 交叉验证所需的子集划分方法from sklearn.model_selection import StratifiedKFold,StratifiedShuffleSplit # 分层分割from sklearn.model_selection import GroupKFold,LeaveOneGroupOut,LeavePGroupsOut,GroupShuffleSplit # 分组分割from sklearn.model_selection import TimeSeriesSplit # 时间序列分割from sklearn import datasets  # 自带数据集from sklearn import svm  # SVM算法from sklearn import preprocessing  # 预处理模块from sklearn.metrics import recall_score  # 模型度量
iris = datasets.load_iris() # 加载数据集print('样本集大小:',iris.data.shape,iris.target.shape)
# ===================================数据集划分,训练模型==========================X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.4, random_state=0) # 交叉验证划分训练集和测试集.test_size为测试集所占的比例print('训练集大小:',X_train.shape,y_train.shape) # 训练集样本大小print('测试集大小:',X_test.shape,y_test.shape) # 测试集样本大小clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train) # 使用训练集训练模型print('准确率:',clf.score(X_test, y_test)) # 计算测试集的度量值(准确率)

# 如果涉及到归一化,则在测试集上也要使用训练集模型提取的归一化函数。scaler = preprocessing.StandardScaler().fit(X_train) # 通过训练集获得归一化函数模型。(也就是先减几,再除以几的函数)。在训练集和测试集上都使用这个归一化函数X_train_transformed = scaler.transform(X_train)clf = svm.SVC(kernel='linear', C=1).fit(X_train_transformed, y_train) # 使用训练集训练模型X_test_transformed = scaler.transform(X_test)print(clf.score(X_test_transformed, y_test)) # 计算测试集的度量值(准确度)
# ===================================直接调用交叉验证评估模型==========================clf = svm.SVC(kernel='linear', C=1)scores = cross_val_score(clf, iris.data, iris.target, cv=5) #cv为迭代次数。print(scores) # 打印输出每次迭代的度量值(准确度)print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2)) # 获取置信区间。(也就是均值和方差)
# ===================================多种度量结果======================================scoring = ['precision_macro', 'recall_macro'] # precision_macro为精度,recall_macro为召回率scores = cross_validate(clf, iris.data, iris.target, scoring=scoring,cv=5, return_train_score=True)sorted(scores.keys())print('测试结果:',scores) # scores类型为字典。包含训练得分,拟合次数, score-times (得分次数)

# ==================================K折交叉验证、留一交叉验证、留p交叉验证、随机排列交叉验证==========================================# k折划分子集kf = KFold(n_splits=2)for train, test in kf.split(iris.data): print("k折划分:%s %s" % (train.shape, test.shape)) break
# 留一划分子集loo = LeaveOneOut()for train, test in loo.split(iris.data): print("留一划分:%s %s" % (train.shape, test.shape)) break
# 留p划分子集lpo = LeavePOut(p=2)for train, test in loo.split(iris.data): print("留p划分:%s %s" % (train.shape, test.shape)) break
# 随机排列划分子集ss = ShuffleSplit(n_splits=3, test_size=0.25,random_state=0)for train_index, test_index in ss.split(iris.data): print("随机排列划分:%s %s" % (train.shape, test.shape)) break
# ==================================分层K折交叉验证、分层随机交叉验证==========================================skf = StratifiedKFold(n_splits=3) #各个类别的比例大致和完整数据集中相同for train, test in skf.split(iris.data, iris.target): print("分层K折划分:%s %s" % (train.shape, test.shape)) break
skf = StratifiedShuffleSplit(n_splits=3) # 划分中每个类的比例和完整数据集中的相同for train, test in skf.split(iris.data, iris.target): print("分层随机划分:%s %s" % (train.shape, test.shape)) break

# ==================================组 k-fold交叉验证、留一组交叉验证、留 P 组交叉验证、Group Shuffle Split==========================================X = [0.1, 0.2, 2.2, 2.4, 2.3, 4.55, 5.8, 8.8, 9, 10]y = ["a", "b", "b", "b", "c", "c", "c", "d", "d", "d"]groups = [1, 1, 1, 2, 2, 2, 3, 3, 3, 3]
# k折分组gkf = GroupKFold(n_splits=3) # 训练集和测试集属于不同的组for train, test in gkf.split(X, y, groups=groups): print("组 k-fold分割:%s %s" % (train, test))
# 留一分组logo = LeaveOneGroupOut()for train, test in logo.split(X, y, groups=groups): print("留一组分割:%s %s" % (train, test))
# 留p分组lpgo = LeavePGroupsOut(n_groups=2)for train, test in lpgo.split(X, y, groups=groups): print("留 P 组分割:%s %s" % (train, test))
# 随机分组gss = GroupShuffleSplit(n_splits=4, test_size=0.5, random_state=0)for train, test in gss.split(X, y, groups=groups): print("随机分割:%s %s" % (train, test))

# ==================================时间序列分割==========================================tscv = TimeSeriesSplit(n_splits=3)TimeSeriesSplit(max_train_size=None, n_splits=3)for train, test in tscv.split(iris.data): print("时间序列分割:%s %s" % (train, test))
复制代码


样本集大小: (150, 4) (150,)训练集大小: (90, 4) (90,)测试集大小: (60, 4) (60,)准确率: 0.96666666666666670.9333333333333333[0.96666667 1.         0.96666667 0.96666667 1.        ]Accuracy: 0.98 (+/- 0.03)测试结果: {'fit_time': array([0.000494  , 0.0005343 , 0.00048256, 0.00053048, 0.00047898]), 'score_time': array([0.00132895, 0.00126219, 0.00118518, 0.00140405, 0.00118995]), 'test_precision_macro': array([0.96969697, 1.        , 0.96969697, 0.96969697, 1.        ]), 'train_precision_macro': array([0.97674419, 0.97674419, 0.99186992, 0.98412698, 0.98333333]), 'test_recall_macro': array([0.96666667, 1.        , 0.96666667, 0.96666667, 1.        ]), 'train_recall_macro': array([0.975     , 0.975     , 0.99166667, 0.98333333, 0.98333333])}k折划分:(75,) (75,)留一划分:(149,) (1,)留p划分:(149,) (1,)随机排列划分:(149,) (1,)分层K折划分:(100,) (50,)分层随机划分:(135,) (15,)组 k-fold分割:[0 1 2 3 4 5] [6 7 8 9]组 k-fold分割:[0 1 2 6 7 8 9] [3 4 5]组 k-fold分割:[3 4 5 6 7 8 9] [0 1 2]留一组分割:[3 4 5 6 7 8 9] [0 1 2]留一组分割:[0 1 2 6 7 8 9] [3 4 5]留一组分割:[0 1 2 3 4 5] [6 7 8 9]留 P 组分割:[6 7 8 9] [0 1 2 3 4 5]留 P 组分割:[3 4 5] [0 1 2 6 7 8 9]留 P 组分割:[0 1 2] [3 4 5 6 7 8 9]随机分割:[0 1 2] [3 4 5 6 7 8 9]随机分割:[3 4 5] [0 1 2 6 7 8 9]随机分割:[3 4 5] [0 1 2 6 7 8 9]随机分割:[3 4 5] [0 1 2 6 7 8 9]时间序列分割:[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38] [39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75]时间序列分割:[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75] [ 76  77  78  79  80  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107 108 109 110 111 112]时间序列分割:[  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107 108 109 110 111 112] [113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149]
复制代码

5.2 网格搜索

Grid Search:一种调参手段;穷举搜索:在所有候选的参数选择中,通过循环遍历,尝试每一种可能性,表现最好的参数就是最终的结果。其原理就像是在数组里找最大值。

5.2.1 简单的网格搜索

from sklearn.datasets import load_irisfrom sklearn.svm import SVCfrom sklearn.model_selection import train_test_split
iris = load_iris()X_train,X_test,y_train,y_test = train_test_split(iris.data,iris.target,random_state=0)print("Size of training set:{} size of testing set:{}".format(X_train.shape[0],X_test.shape[0]))
#### grid search startbest_score = 0for gamma in [0.001,0.01,0.1,1,10,100]: for C in [0.001,0.01,0.1,1,10,100]: svm = SVC(gamma=gamma,C=C)#对于每种参数可能的组合,进行一次训练; svm.fit(X_train,y_train) score = svm.score(X_test,y_test) if score > best_score:#找到表现最好的参数 best_score = score best_parameters = {'gamma':gamma,'C':C}#### grid search end
print("Best score:{:.2f}".format(best_score))print("Best parameters:{}".format(best_parameters))
复制代码


Size of training set:112 size of testing set:38Best score:0.97Best parameters:{'gamma': 0.001, 'C': 100}
复制代码

5.2.2 Grid Search with Cross Validation(具有交叉验证的网格搜索)

X_trainval,X_test,y_trainval,y_test = train_test_split(iris.data,iris.target,random_state=0)X_train,X_val,y_train,y_val = train_test_split(X_trainval,y_trainval,random_state=1)print("Size of training set:{} size of validation set:{} size of testing set:{}".format(X_train.shape[0],X_val.shape[0],X_test.shape[0]))
best_score = 0.0for gamma in [0.001,0.01,0.1,1,10,100]: for C in [0.001,0.01,0.1,1,10,100]: svm = SVC(gamma=gamma,C=C) svm.fit(X_train,y_train) score = svm.score(X_val,y_val) if score > best_score: best_score = score best_parameters = {'gamma':gamma,'C':C}svm = SVC(**best_parameters) #使用最佳参数,构建新的模型svm.fit(X_trainval,y_trainval) #使用训练集和验证集进行训练,more data always results in good performance.test_score = svm.score(X_test,y_test) # evaluation模型评估print("Best score on validation set:{:.2f}".format(best_score))print("Best parameters:{}".format(best_parameters))print("Best score on test set:{:.2f}".format(test_score))
复制代码


Size of training set:84 size of validation set:28 size of testing set:38Best score on validation set:0.96Best parameters:{'gamma': 0.001, 'C': 10}Best score on test set:0.92
复制代码


from sklearn.model_selection import cross_val_score
best_score = 0.0for gamma in [0.001,0.01,0.1,1,10,100]: for C in [0.001,0.01,0.1,1,10,100]: svm = SVC(gamma=gamma,C=C) scores = cross_val_score(svm,X_trainval,y_trainval,cv=5) #5折交叉验证 score = scores.mean() #取平均数 if score > best_score: best_score = score best_parameters = {"gamma":gamma,"C":C}svm = SVC(**best_parameters)svm.fit(X_trainval,y_trainval)test_score = svm.score(X_test,y_test)print("Best score on validation set:{:.2f}".format(best_score))print("Best parameters:{}".format(best_parameters))print("Score on testing set:{:.2f}".format(test_score))
复制代码


Best score on validation set:0.97Best parameters:{'gamma': 0.1, 'C': 10}Score on testing set:0.97
复制代码


交叉验证经常与网格搜索进行结合,作为参数评价的一种方法,这种方法叫做 grid search with cross validation。sklearn 因此设计了一个这样的类 GridSearchCV,这个类实现了 fit,predict,score 等方法,被当做了一个 estimator,使用 fit 方法,该过程中:(1)搜索到最佳参数;(2)实例化了一个最佳参数的 estimator;


from sklearn.model_selection import GridSearchCV
#把要调整的参数以及其候选值 列出来;param_grid = {"gamma":[0.001,0.01,0.1,1,10,100], "C":[0.001,0.01,0.1,1,10,100]}print("Parameters:{}".format(param_grid))
grid_search = GridSearchCV(SVC(),param_grid,cv=5) #实例化一个GridSearchCV类X_train,X_test,y_train,y_test = train_test_split(iris.data,iris.target,random_state=10)grid_search.fit(X_train,y_train) #训练,找到最优的参数,同时使用最优的参数实例化一个新的SVC estimator。print("Test set score:{:.2f}".format(grid_search.score(X_test,y_test)))print("Best parameters:{}".format(grid_search.best_params_))print("Best score on train set:{:.2f}".format(grid_search.best_score_))
复制代码


Parameters:{'gamma': [0.001, 0.01, 0.1, 1, 10, 100], 'C': [0.001, 0.01, 0.1, 1, 10, 100]}Test set score:0.97Best parameters:{'C': 10, 'gamma': 0.1}Best score on train set:0.98
复制代码

5.2.3 学习曲线

import numpy as npimport matplotlib.pyplot as plt
from sklearn.naive_bayes import GaussianNBfrom sklearn.svm import SVC
from sklearn.datasets import load_digitsfrom sklearn.model_selection import learning_curvefrom sklearn.model_selection import ShuffleSplit
复制代码


def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,                        n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5)):    plt.figure()    plt.title(title)    if ylim is not None:        plt.ylim(*ylim)    plt.xlabel("Training examples")    plt.ylabel("Score")    train_sizes, train_scores, test_scores = learning_curve(        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)    train_scores_mean = np.mean(train_scores, axis=1)    train_scores_std = np.std(train_scores, axis=1)    test_scores_mean = np.mean(test_scores, axis=1)    test_scores_std = np.std(test_scores, axis=1)    plt.grid()
plt.fill_between(train_sizes, train_scores_mean - train_scores_std, train_scores_mean + train_scores_std, alpha=0.1, color="r") plt.fill_between(train_sizes, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std, alpha=0.1, color="g") plt.plot(train_sizes, train_scores_mean, 'o-', color="r", label="Training score") plt.plot(train_sizes, test_scores_mean, 'o-', color="g", label="Cross-validation score")
plt.legend(loc="best") return plt
复制代码


digits = load_digits()X, y = digits.data, digits.target

title = "Learning Curves (Naive Bayes)"# Cross validation with 100 iterations to get smoother mean test and train# score curves, each time with 20% data randomly selected as a validation set.cv = ShuffleSplit(n_splits=100, test_size=0.2, random_state=0)
estimator = GaussianNB()plot_learning_curve(estimator, title, X, y, ylim=(0.7, 1.01), cv=cv, n_jobs=4)
复制代码


<module 'matplotlib.pyplot' from '/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/matplotlib/pyplot.py'>
复制代码



title = "Learning Curves (SVM, RBF kernel, $\gamma=0.001$)"# SVC is more expensive so we do a lower number of CV iterations:cv = ShuffleSplit(n_splits=10, test_size=0.2, random_state=0)estimator = SVC(gamma=0.001)plot_learning_curve(estimator, title, X, y, (0.7, 1.01), cv=cv, n_jobs=4)
复制代码


<module 'matplotlib.pyplot' from '/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/matplotlib/pyplot.py'>
复制代码


5.2.4 验证曲线

import matplotlib.pyplot as pltimport numpy as npfrom sklearn.datasets import load_digitsfrom sklearn.svm import SVCfrom sklearn. model_selection import validation_curve
digits = load_digits()X, y = digits.data, digits.target
param_range = np.logspace(-6, -1, 5)train_scores, test_scores = validation_curve( SVC(), X, y, param_name="gamma", param_range=param_range, cv=10, scoring="accuracy", n_jobs=1)train_scores_mean = np.mean(train_scores, axis=1)train_scores_std = np.std(train_scores, axis=1)test_scores_mean = np.mean(test_scores, axis=1)test_scores_std = np.std(test_scores, axis=1)
plt.title("Validation Curve with SVM")plt.xlabel("$\gamma$")plt.ylabel("Score")plt.ylim(0.0, 1.1)plt.semilogx(param_range, train_scores_mean, label="Training score", color="r")plt.fill_between(param_range, train_scores_mean - train_scores_std, train_scores_mean + train_scores_std, alpha=0.2, color="r")plt.semilogx(param_range, test_scores_mean, label="Cross-validation score", color="g")plt.fill_between(param_range, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std, alpha=0.2, color="g")plt.legend(loc="best")plt.show()
复制代码


5.3 工业蒸汽赛题模型验证

5.3.1 模型过拟合与欠拟合

import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as sns
from scipy import stats
import warningswarnings.filterwarnings("ignore")
from sklearn.linear_model import LinearRegression #线性回归from sklearn.neighbors import KNeighborsRegressor #K近邻回归from sklearn.tree import DecisionTreeRegressor #决策树回归from sklearn.ensemble import RandomForestRegressor #随机森林回归from sklearn.svm import SVR #支持向量回归import lightgbm as lgb #lightGbm模型
from sklearn.model_selection import train_test_split # 切分数据from sklearn.metrics import mean_squared_error #评价指标
from sklearn.linear_model import SGDRegressor
复制代码


# 下载需要用到的数据集!wget http://tianchi-media.oss-cn-beijing.aliyuncs.com/DSW/Industrial_Steam_Forecast/zhengqi_test.txt!wget http://tianchi-media.oss-cn-beijing.aliyuncs.com/DSW/Industrial_Steam_Forecast/zhengqi_train.txt
复制代码


--2023-03-24 22:17:50--  http://tianchi-media.oss-cn-beijing.aliyuncs.com/DSW/Industrial_Steam_Forecast/zhengqi_test.txt正在解析主机 tianchi-media.oss-cn-beijing.aliyuncs.com (tianchi-media.oss-cn-beijing.aliyuncs.com)... 49.7.22.39正在连接 tianchi-media.oss-cn-beijing.aliyuncs.com (tianchi-media.oss-cn-beijing.aliyuncs.com)|49.7.22.39|:80... 已连接。已发出 HTTP 请求,正在等待回应... 200 OK长度: 466959 (456K) [text/plain]正在保存至: “zhengqi_test.txt.2”
zhengqi_test.txt.2 100%[===================>] 456.01K --.-KB/s in 0.03s
2023-03-24 22:17:51 (13.2 MB/s) - 已保存 “zhengqi_test.txt.2” [466959/466959])
--2023-03-24 22:17:51-- http://tianchi-media.oss-cn-beijing.aliyuncs.com/DSW/Industrial_Steam_Forecast/zhengqi_train.txt正在解析主机 tianchi-media.oss-cn-beijing.aliyuncs.com (tianchi-media.oss-cn-beijing.aliyuncs.com)... 49.7.22.39正在连接 tianchi-media.oss-cn-beijing.aliyuncs.com (tianchi-media.oss-cn-beijing.aliyuncs.com)|49.7.22.39|:80... 已连接。已发出 HTTP 请求,正在等待回应... 200 OK长度: 714370 (698K) [text/plain]正在保存至: “zhengqi_train.txt.2”
zhengqi_train.txt.2 100%[===================>] 697.63K --.-KB/s in 0.04s
2023-03-24 22:17:51 (17.8 MB/s) - 已保存 “zhengqi_train.txt.2” [714370/714370])
复制代码


train_data_file = "./zhengqi_train.txt"test_data_file =  "./zhengqi_test.txt"
train_data = pd.read_csv(train_data_file, sep='\t', encoding='utf-8')test_data = pd.read_csv(test_data_file, sep='\t', encoding='utf-8')
复制代码


from sklearn import preprocessing 
features_columns = [col for col in train_data.columns if col not in ['target']]
min_max_scaler = preprocessing.MinMaxScaler()
min_max_scaler = min_max_scaler.fit(train_data[features_columns])
train_data_scaler = min_max_scaler.transform(train_data[features_columns])test_data_scaler = min_max_scaler.transform(test_data[features_columns])
train_data_scaler = pd.DataFrame(train_data_scaler)train_data_scaler.columns = features_columns
test_data_scaler = pd.DataFrame(test_data_scaler)test_data_scaler.columns = features_columns
train_data_scaler['target'] = train_data['target']
复制代码


from sklearn.decomposition import PCA   #主成分分析法
#PCA方法降维#保留16个主成分pca = PCA(n_components=16)new_train_pca_16 = pca.fit_transform(train_data_scaler.iloc[:,0:-1])new_test_pca_16 = pca.transform(test_data_scaler)new_train_pca_16 = pd.DataFrame(new_train_pca_16)new_test_pca_16 = pd.DataFrame(new_test_pca_16)new_train_pca_16['target'] = train_data_scaler['target']
复制代码


#采用 pca 保留16维特征的数据new_train_pca_16 = new_train_pca_16.fillna(0)train = new_train_pca_16[new_test_pca_16.columns]target = new_train_pca_16['target']
# 切分数据 训练数据80% 验证数据20%train_data,test_data,train_target,test_target=train_test_split(train,target,test_size=0.2,random_state=0)
复制代码


#### 欠拟合clf = SGDRegressor(max_iter=500, tol=1e-2) clf.fit(train_data, train_target)score_train = mean_squared_error(train_target, clf.predict(train_data))score_test = mean_squared_error(test_target, clf.predict(test_data))print("SGDRegressor train MSE:   ", score_train)print("SGDRegressor test MSE:   ", score_test)
复制代码


SGDRegressor train MSE:    0.15125847407064866SGDRegressor test MSE:    0.15565698772176442
复制代码


### 过拟合from sklearn.preprocessing import PolynomialFeaturespoly = PolynomialFeatures(5)train_data_poly = poly.fit_transform(train_data)test_data_poly = poly.transform(test_data)clf = SGDRegressor(max_iter=1000, tol=1e-3) clf.fit(train_data_poly, train_target)score_train = mean_squared_error(train_target, clf.predict(train_data_poly))score_test = mean_squared_error(test_target, clf.predict(test_data_poly))print("SGDRegressor train MSE:   ", score_train)print("SGDRegressor test MSE:   ", score_test)
复制代码


SGDRegressor train MSE:    0.13230725829556678SGDRegressor test MSE:    0.14475818228220433
复制代码


### 正常拟合from sklearn.preprocessing import PolynomialFeaturespoly = PolynomialFeatures(3)train_data_poly = poly.fit_transform(train_data)test_data_poly = poly.transform(test_data)clf = SGDRegressor(max_iter=1000, tol=1e-3) clf.fit(train_data_poly, train_target)score_train = mean_squared_error(train_target, clf.predict(train_data_poly))score_test = mean_squared_error(test_target, clf.predict(test_data_poly))print("SGDRegressor train MSE:   ", score_train)print("SGDRegressor test MSE:   ", score_test)
复制代码


SGDRegressor train MSE:    0.13399656558429307SGDRegressor test MSE:    0.14255473176638828
复制代码

5.3.2 模型正则化

L2 范数正则化


poly = PolynomialFeatures(3)train_data_poly = poly.fit_transform(train_data)test_data_poly = poly.transform(test_data)clf = SGDRegressor(max_iter=1000, tol=1e-3, penalty= 'L2', alpha=0.0001) clf.fit(train_data_poly, train_target)score_train = mean_squared_error(train_target, clf.predict(train_data_poly))score_test = mean_squared_error(test_target, clf.predict(test_data_poly))print("SGDRegressor train MSE:   ", score_train)print("SGDRegressor test MSE:   ", score_test)
复制代码


SGDRegressor train MSE:    0.1344679787727263SGDRegressor test MSE:    0.14283084627234435
复制代码


L1 范数正则化


poly = PolynomialFeatures(3)train_data_poly = poly.fit_transform(train_data)test_data_poly = poly.transform(test_data)clf = SGDRegressor(max_iter=1000, tol=1e-3, penalty= 'L1', alpha=0.00001) clf.fit(train_data_poly, train_target)score_train = mean_squared_error(train_target, clf.predict(train_data_poly))score_test = mean_squared_error(test_target, clf.predict(test_data_poly))print("SGDRegressor train MSE:   ", score_train)print("SGDRegressor test MSE:   ", score_test)
复制代码


SGDRegressor train MSE:    0.13516056789895906SGDRegressor test MSE:    0.14330444056183564
复制代码


ElasticNet L1 和 L2 范数加权正则化


poly = PolynomialFeatures(3)train_data_poly = poly.fit_transform(train_data)test_data_poly = poly.transform(test_data)clf = SGDRegressor(max_iter=1000, tol=1e-3, penalty= 'elasticnet', l1_ratio=0.9, alpha=0.00001) clf.fit(train_data_poly, train_target)score_train = mean_squared_error(train_target, clf.predict(train_data_poly))score_test = mean_squared_error(test_target, clf.predict(test_data_poly))print("SGDRegressor train MSE:   ", score_train)print("SGDRegressor test MSE:   ", score_test)
复制代码


SGDRegressor train MSE:    0.13409834594770004SGDRegressor test MSE:    0.14238154901534278
复制代码

5.3.3 模型交叉验证

简单交叉验证 Hold-out-menthod


# 简单交叉验证from sklearn.model_selection import train_test_split # 切分数据# 切分数据 训练数据80% 验证数据20%train_data,test_data,train_target,test_target=train_test_split(train,target,test_size=0.2,random_state=0)
clf = SGDRegressor(max_iter=1000, tol=1e-3) clf.fit(train_data, train_target)score_train = mean_squared_error(train_target, clf.predict(train_data))score_test = mean_squared_error(test_target, clf.predict(test_data))print("SGDRegressor train MSE: ", score_train)print("SGDRegressor test MSE: ", score_test)
复制代码


SGDRegressor train MSE:    0.14143759510386256SGDRegressor test MSE:    0.14691862910491496
复制代码


K 折交叉验证 K-fold CV


# 5折交叉验证from sklearn.model_selection import KFold
kf = KFold(n_splits=5)for k, (train_index, test_index) in enumerate(kf.split(train)): train_data,test_data,train_target,test_target = train.values[train_index],train.values[test_index],target[train_index],target[test_index] clf = SGDRegressor(max_iter=1000, tol=1e-3) clf.fit(train_data, train_target) score_train = mean_squared_error(train_target, clf.predict(train_data)) score_test = mean_squared_error(test_target, clf.predict(test_data)) print(k, " 折", "SGDRegressor train MSE: ", score_train) print(k, " 折", "SGDRegressor test MSE: ", score_test, '\n')
复制代码


0  折 SGDRegressor train MSE:    0.149893137564695050  折 SGDRegressor test MSE:    0.10630068590577227 
1 折 SGDRegressor train MSE: 0.13352690453351981 折 SGDRegressor test MSE: 0.18239988520454367
2 折 SGDRegressor train MSE: 0.147134776271396342 折 SGDRegressor test MSE: 0.13314646232843022
3 折 SGDRegressor train MSE: 0.140677310275378363 折 SGDRegressor test MSE: 0.16311142798019898
4 折 SGDRegressor train MSE: 0.138095270909418034 折 SGDRegressor test MSE: 0.16535259610698216
复制代码


留一法 LOO CV


from sklearn.model_selection import LeaveOneOutloo = LeaveOneOut()num = 100for k, (train_index, test_index) in enumerate(loo.split(train)):    train_data,test_data,train_target,test_target = train.values[train_index],train.values[test_index],target[train_index],target[test_index]    clf = SGDRegressor(max_iter=1000, tol=1e-3)     clf.fit(train_data, train_target)    score_train = mean_squared_error(train_target, clf.predict(train_data))    score_test = mean_squared_error(test_target, clf.predict(test_data))    print(k, " 个", "SGDRegressor train MSE:   ", score_train)    print(k, " 个", "SGDRegressor test MSE:   ", score_test, '\n')     if k >= 9:        break
复制代码


0  个 SGDRegressor train MSE:    0.141673362968093380  个 SGDRegressor test MSE:    0.013368856967176993 
1 个 SGDRegressor train MSE: 0.141584310106047861 个 SGDRegressor test MSE: 0.12481451551630947
2 个 SGDRegressor train MSE: 0.141502525551213762 个 SGDRegressor test MSE: 0.03855470133268372
3 个 SGDRegressor train MSE: 0.141649824905864973 个 SGDRegressor test MSE: 0.004218299742968551
4 个 SGDRegressor train MSE: 0.14157240241444914 个 SGDRegressor test MSE: 0.012171393307787685
5 个 SGDRegressor train MSE: 0.141643308490858165 个 SGDRegressor test MSE: 0.13457429896691775
6 个 SGDRegressor train MSE: 0.141628392588231346 个 SGDRegressor test MSE: 0.022584321520003964
7 个 SGDRegressor train MSE: 0.141565356301183587 个 SGDRegressor test MSE: 0.0007881735114026308
8 个 SGDRegressor train MSE: 0.141614037329566878 个 SGDRegressor test MSE: 0.09236755222443295
9 个 SGDRegressor train MSE: 0.14165186781237769 个 SGDRegressor test MSE: 0.049938663947863705
复制代码


留 P 法 LPO CV


from sklearn.model_selection import LeavePOutlpo = LeavePOut(p=10)num = 100for k, (train_index, test_index) in enumerate(lpo.split(train)):    train_data,test_data,train_target,test_target = train.values[train_index],train.values[test_index],target[train_index],target[test_index]    clf = SGDRegressor(max_iter=1000, tol=1e-3)     clf.fit(train_data, train_target)    score_train = mean_squared_error(train_target, clf.predict(train_data))    score_test = mean_squared_error(test_target, clf.predict(test_data))    print(k, " 10个", "SGDRegressor train MSE:   ", score_train)    print(k, " 10个", "SGDRegressor test MSE:   ", score_test, '\n')     if k >= 9:        break
复制代码


0  10个 SGDRegressor train MSE:    0.141885472410738460  10个 SGDRegressor test MSE:    0.04919852578302554 
1 10个 SGDRegressor train MSE: 0.14196288999702831 10个 SGDRegressor test MSE: 0.0452239727984194
2 10个 SGDRegressor train MSE: 0.142132712216060722 10个 SGDRegressor test MSE: 0.04699670484045908
3 10个 SGDRegressor train MSE: 0.141974671532535433 10个 SGDRegressor test MSE: 0.054453728030175695
4 10个 SGDRegressor train MSE: 0.141878793418941224 10个 SGDRegressor test MSE: 0.06924591926518929
5 10个 SGDRegressor train MSE: 0.142018205867373325 10个 SGDRegressor test MSE: 0.04544729649569867
6 10个 SGDRegressor train MSE: 0.14203218776681326 10个 SGDRegressor test MSE: 0.04932459950875607
7 10个 SGDRegressor train MSE: 0.14191664257811827 10个 SGDRegressor test MSE: 0.05328512633699939
8 10个 SGDRegressor train MSE: 0.14139333553391148 10个 SGDRegressor test MSE: 0.04634695705557035
9 10个 SGDRegressor train MSE: 0.141880823366834869 10个 SGDRegressor test MSE: 0.045133396081342994
复制代码

5.3.4 模型超参空间及调参

穷举网格搜索


from sklearn.model_selection import GridSearchCVfrom sklearn.ensemble import RandomForestRegressorfrom sklearn.model_selection import train_test_split # 切分数据# 切分数据 训练数据80% 验证数据20%train_data,test_data,train_target,test_target=train_test_split(train,target,test_size=0.2,random_state=0)
randomForestRegressor = RandomForestRegressor()parameters = { 'n_estimators':[50, 100, 200], 'max_depth':[1, 2, 3] }

clf = GridSearchCV(randomForestRegressor, parameters, cv=5)clf.fit(train_data, train_target)
score_test = mean_squared_error(test_target, clf.predict(test_data))
print("RandomForestRegressor GridSearchCV test MSE: ", score_test)sorted(clf.cv_results_.keys())
复制代码


RandomForestRegressor GridSearchCV test MSE:    0.2595696984416692




['mean_fit_time', 'mean_score_time', 'mean_test_score', 'param_max_depth', 'param_n_estimators', 'params', 'rank_test_score', 'split0_test_score', 'split1_test_score', 'split2_test_score', 'split3_test_score', 'split4_test_score', 'std_fit_time', 'std_score_time', 'std_test_score']
复制代码


随机参数优化


from sklearn.model_selection import RandomizedSearchCVfrom sklearn.ensemble import RandomForestRegressorfrom sklearn.model_selection import train_test_split # 切分数据# 切分数据 训练数据80% 验证数据20%train_data,test_data,train_target,test_target=train_test_split(train,target,test_size=0.2,random_state=0)
randomForestRegressor = RandomForestRegressor()parameters = { 'n_estimators':[10, 50], 'max_depth':[1, 2, 5] }

clf = RandomizedSearchCV(randomForestRegressor, parameters, cv=5)clf.fit(train_data, train_target)
score_test = mean_squared_error(test_target, clf.predict(test_data))
print("RandomForestRegressor RandomizedSearchCV test MSE: ", score_test)sorted(clf.cv_results_.keys())
复制代码


RandomForestRegressor RandomizedSearchCV test MSE:    0.1952974248358807




['mean_fit_time', 'mean_score_time', 'mean_test_score', 'param_max_depth', 'param_n_estimators', 'params', 'rank_test_score', 'split0_test_score', 'split1_test_score', 'split2_test_score', 'split3_test_score', 'split4_test_score', 'std_fit_time', 'std_score_time', 'std_test_score']
复制代码


Lgb 调参


!pip install lightgbm
复制代码


Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simpleRequirement already satisfied: lightgbm in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (3.1.1)Requirement already satisfied: wheel in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from lightgbm) (0.33.6)Requirement already satisfied: numpy in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from lightgbm) (1.19.5)Requirement already satisfied: scikit-learn!=0.22.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from lightgbm) (0.22.1)Requirement already satisfied: scipy in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from lightgbm) (1.3.0)Requirement already satisfied: joblib>=0.11 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-learn!=0.22.0->lightgbm) (0.14.1)
[notice] A new release of pip available: 22.1.2 -> 23.0.1[notice] To update, run: pip install --upgrade pip
复制代码


clf = lgb.LGBMRegressor(num_leaves=21)#num_leaves=31
parameters = { 'learning_rate': [0.01, 0.1], 'n_estimators': [20, 40]}
clf = GridSearchCV(clf, parameters, cv=5)clf.fit(train_data, train_target)
print('Best parameters found by grid search are:', clf.best_params_)score_test = mean_squared_error(test_target, clf.predict(test_data))print("LGBMRegressor RandomizedSearchCV test MSE: ", score_test)
复制代码


Lgb 线下验证


train_data2 = pd.read_csv('./zhengqi_train.txt',sep='\t')test_data2 = pd.read_csv('./zhengqi_test.txt',sep='\t')
train_data2_f = train_data2[test_data2.columns].valuestrain_data2_target = train_data2['target'].values
复制代码


报错信息:TypeError: init() got an unexpected keyword argument 'n_folds'


# lgb 模型from sklearn.model_selection  import KFoldimport lightgbm as lgbimport numpy as np

# 5折交叉验证Folds=5kf = KFold(n_splits=Folds, random_state=100, shuffle=True)#注意参数修改
# 导入错误的KFold包# from sklearn.cross_validation import KFold 已经淘汰,需要改为from sklearn.model_selection import KFold,具体信息参见Sklearn官方文档# 使用错误的参数# kf = KFold(titanic.shape[0], n_folds=3, random_state=1)由于sklearn的更新,Kfold的参数已经更改, n_folds更改为n_splits,前文代码更改为kf = KFold(n_splits=3, shuffle=False, random_state=1),如果不更改,会发生报错TypeError: __init__() got multiple values for argument 'n_splits'# 除此之外,for train, test in kf:同时更改为for train, test in kf.split(titanic[predictions]): 此时相当于用predictions来进行折叠交叉划分。

# 记录训练和预测MSEMSE_DICT = { 'train_mse':[], 'test_mse':[]}

# 线下训练预测for i, (train_index, test_index) in enumerate (kf.split(train_data2_f)): # lgb树模型 lgb_reg = lgb.LGBMRegressor( learning_rate=0.01, max_depth=-1, n_estimators=100, boosting_type='gbdt', random_state=100, objective='regression', ) # 切分训练集和预测集 X_train_KFold, X_test_KFold = train_data2_f[train_index], train_data2_f[test_index] y_train_KFold, y_test_KFold = train_data2_target[train_index], train_data2_target[test_index] # 训练模型# reg.fit(X_train_KFold, y_train_KFold) lgb_reg.fit( X=X_train_KFold,y=y_train_KFold, eval_set=[(X_train_KFold, y_train_KFold),(X_test_KFold, y_test_KFold)], eval_names=['Train','Test'], early_stopping_rounds=10, eval_metric='MSE', verbose=50 )

# 训练集预测 测试集预测 y_train_KFold_predict = lgb_reg.predict(X_train_KFold,num_iteration=lgb_reg.best_iteration_) y_test_KFold_predict = lgb_reg.predict(X_test_KFold,num_iteration=lgb_reg.best_iteration_) print('第{}折 训练和预测 训练MSE 预测MSE'.format(i)) train_mse = mean_squared_error(y_train_KFold_predict, y_train_KFold) print('------\n', '训练MSE\n', train_mse, '\n------') test_mse = mean_squared_error(y_test_KFold_predict, y_test_KFold) print('------\n', '预测MSE\n', test_mse, '\n------\n') MSE_DICT['train_mse'].append(train_mse) MSE_DICT['test_mse'].append(test_mse)print('------\n', '训练MSE\n', MSE_DICT['train_mse'], '\n', np.mean(MSE_DICT['train_mse']), '\n------')print('------\n', '预测MSE\n', MSE_DICT['test_mse'], '\n', np.mean(MSE_DICT['test_mse']), '\n------')
复制代码

5.3.5 学习曲线和验证曲线

### 学习曲线print(__doc__)import numpy as npimport matplotlib.pyplot as pltfrom sklearn import model_selection from sklearn.linear_model import SGDRegressorfrom sklearn.model_selection import learning_curve

def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None, n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5)): plt.figure() plt.title(title) if ylim is not None: plt.ylim(*ylim) plt.xlabel("Training examples") plt.ylabel("Score") train_sizes, train_scores, test_scores = learning_curve( estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes) train_scores_mean = np.mean(train_scores, axis=1) train_scores_std = np.std(train_scores, axis=1) test_scores_mean = np.mean(test_scores, axis=1) test_scores_std = np.std(test_scores, axis=1) plt.grid()
plt.fill_between(train_sizes, train_scores_mean - train_scores_std, train_scores_mean + train_scores_std, alpha=0.1, color="r") plt.fill_between(train_sizes, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std, alpha=0.1, color="g") plt.plot(train_sizes, train_scores_mean, 'o-', color="r", label="Training score") plt.plot(train_sizes, test_scores_mean, 'o-', color="g", label="Cross-validation score")
plt.legend(loc="best") return plt

X = train_data2[test_data2.columns].valuesy = train_data2['target'].values

title = "LinearRegression"# Cross validation with 100 iterations to get smoother mean test and train# score curves, each time with 20% data randomly selected as a validation set.# cv = model_selection.ShuffleSplit(X.shape[0], n_splits=100,# test_size=0.2, random_state=0)
cv = model_selection.ShuffleSplit(n_splits=100, test_size=0.2, random_state=0)
estimator = SGDRegressor()plot_learning_curve(estimator, title, X, y, ylim=(0.7, 1.01), cv=cv, n_jobs=-1)
复制代码


Automatically created module for IPython interactive environment




<module 'matplotlib.pyplot' from '/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/matplotlib/pyplot.py'>
复制代码



TypeError:__init __()为参数’n_splits’获得了多个值


### 验证曲线print(__doc__)
import matplotlib.pyplot as pltimport numpy as npfrom sklearn.linear_model import SGDRegressorfrom sklearn.model_selection import validation_curve
X = train_data2[test_data2.columns].valuesy = train_data2['target'].values# max_iter=1000, tol=1e-3, penalty= 'L1', alpha=0.00001
param_range = [0.1, 0.01, 0.001, 0.0001, 0.00001, 0.000001]train_scores, test_scores = validation_curve( SGDRegressor(max_iter=1000, tol=1e-3, penalty= 'L1'), X, y, param_name="alpha", param_range=param_range, cv=10, scoring='r2', n_jobs=1)train_scores_mean = np.mean(train_scores, axis=1)train_scores_std = np.std(train_scores, axis=1)test_scores_mean = np.mean(test_scores, axis=1)test_scores_std = np.std(test_scores, axis=1)
plt.title("Validation Curve with SGDRegressor")plt.xlabel("alpha")plt.ylabel("Score")plt.ylim(0.0, 1.1)plt.semilogx(param_range, train_scores_mean, label="Training score", color="r")plt.fill_between(param_range, train_scores_mean - train_scores_std, train_scores_mean + train_scores_std, alpha=0.2, color="r")plt.semilogx(param_range, test_scores_mean, label="Cross-validation score", color="g")plt.fill_between(param_range, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std, alpha=0.2, color="g")plt.legend(loc="best")plt.show()
复制代码


Automatically created module for IPython interactive environment
复制代码


6.特征优化

6.1 定义特征构造方法,构造特征

#导入数据import pandas as pd
train_data_file = "./zhengqi_train.txt"test_data_file = "./zhengqi_test.txt"
train_data = pd.read_csv(train_data_file, sep='\t', encoding='utf-8')test_data = pd.read_csv(test_data_file, sep='\t', encoding='utf-8')
epsilon=1e-5
#组交叉特征,可以自行定义,如增加: x*x/y, log(x)/y 等等func_dict = { 'add': lambda x,y: x+y, 'mins': lambda x,y: x-y, 'div': lambda x,y: x/(y+epsilon), 'multi': lambda x,y: x*y }
### 定义特征构造的函数def auto_features_make(train_data,test_data,func_dict,col_list): train_data, test_data = train_data.copy(), test_data.copy() for col_i in col_list: for col_j in col_list: for func_name, func in func_dict.items(): for data in [train_data,test_data]: func_features = func(data[col_i],data[col_j]) col_func_features = '-'.join([col_i,func_name,col_j]) data[col_func_features] = func_features return train_data,test_data
### 对训练集和测试集数据进行特征构造train_data2, test_data2 = auto_features_make(train_data,test_data,func_dict,col_list=test_data.columns)
from sklearn.decomposition import PCA #主成分分析法
#PCA方法降维pca = PCA(n_components=500)train_data2_pca = pca.fit_transform(train_data2.iloc[:,0:-1])test_data2_pca = pca.transform(test_data2)train_data2_pca = pd.DataFrame(train_data2_pca)test_data2_pca = pd.DataFrame(test_data2_pca)train_data2_pca['target'] = train_data2['target']
X_train2 = train_data2[test_data2.columns].valuesy_train = train_data2['target']
复制代码

6.2 基于 lightgbm 对构造特征进行训练和评估

# ls_validation ifrom sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_errorimport lightgbm as lgbimport numpy as np
# 5折交叉验证,版本迭代参数更新Folds=5kf = KFold(n_splits=Folds, shuffle=True, random_state=2019)

# 版本修改导致用法有不同:# 1. n_folds参数修改为n_splits# 2. train_data.shape[0]参数被去除。所以你的这一行修改为kf =KFold(n_splits=3, random_state=1)# 然后在你后面要用到kf的地方,比如原来的:# for i, (train_index, test_index) in enumerate(kf):# 修改成:# for i, (train_index, test_index) in enumerate(kf.split(train)):# train就是你的训练数据

# 记录训练和预测MSEMSE_DICT = { 'train_mse':[], 'test_mse':[]}
# 线下训练预测for i, (train_index, test_index) in enumerate(kf.split(X_train2)): # lgb树模型 lgb_reg = lgb.LGBMRegressor( learning_rate=0.01, max_depth=-1, n_estimators=100, #记得修改 boosting_type='gbdt', random_state=2019, objective='regression', ) # 切分训练集和预测集 X_train_KFold, X_test_KFold = X_train2[train_index], X_train2[test_index] y_train_KFold, y_test_KFold = y_train[train_index], y_train[test_index] # 训练模型 lgb_reg.fit( X=X_train_KFold,y=y_train_KFold, eval_set=[(X_train_KFold, y_train_KFold),(X_test_KFold, y_test_KFold)], eval_names=['Train','Test'], early_stopping_rounds=10, #记得修改 eval_metric='MSE', verbose=50 )

# 训练集预测 测试集预测 y_train_KFold_predict = lgb_reg.predict(X_train_KFold,num_iteration=lgb_reg.best_iteration_) y_test_KFold_predict = lgb_reg.predict(X_test_KFold,num_iteration=lgb_reg.best_iteration_) print('第{}折 训练和预测 训练MSE 预测MSE'.format(i)) train_mse = mean_squared_error(y_train_KFold_predict, y_train_KFold) print('------\n', '训练MSE\n', train_mse, '\n------') test_mse = mean_squared_error(y_test_KFold_predict, y_test_KFold) print('------\n', '预测MSE\n', test_mse, '\n------\n') MSE_DICT['train_mse'].append(train_mse) MSE_DICT['test_mse'].append(test_mse)print('------\n', '训练MSE\n', MSE_DICT['train_mse'], '\n', np.mean(MSE_DICT['train_mse']), '\n------')print('------\n', '预测MSE\n', MSE_DICT['test_mse'], '\n', np.mean(MSE_DICT['test_mse']), '\n------')
复制代码


Training until validation scores don't improve for 10 rounds


---------------------------------------------------------------------------
KeyboardInterrupt Traceback (most recent call last)
/tmp/ipykernel_5027/2900171053.py in <module> 49 early_stopping_rounds=10, #记得修改 50 eval_metric='MSE',---> 51 verbose=50 52 ) 53

/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/lightgbm/sklearn.py in fit(self, X, y, sample_weight, init_score, eval_set, eval_names, eval_sample_weight, eval_init_score, eval_metric, early_stopping_rounds, verbose, feature_name, categorical_feature, callbacks, init_model) 777 verbose=verbose, feature_name=feature_name, 778 categorical_feature=categorical_feature,--> 779 callbacks=callbacks, init_model=init_model) 780 return self 781

/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/lightgbm/sklearn.py in fit(self, X, y, sample_weight, init_score, group, eval_set, eval_names, eval_sample_weight, eval_class_weight, eval_init_score, eval_group, eval_metric, early_stopping_rounds, verbose, feature_name, categorical_feature, callbacks, init_model) 615 evals_result=evals_result, fobj=self._fobj, feval=eval_metrics_callable, 616 verbose_eval=verbose, feature_name=feature_name,--> 617 callbacks=callbacks, init_model=init_model) 618 619 if evals_result:

/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/lightgbm/engine.py in train(params, train_set, num_boost_round, valid_sets, valid_names, fobj, feval, init_model, feature_name, categorical_feature, early_stopping_rounds, evals_result, verbose_eval, learning_rates, keep_training_booster, callbacks) 250 evaluation_result_list=None)) 251 --> 252 booster.update(fobj=fobj) 253 254 evaluation_result_list = []

/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/lightgbm/basic.py in update(self, train_set, fobj) 2458 _safe_call(_LIB.LGBM_BoosterUpdateOneIter( 2459 self.handle,-> 2460 ctypes.byref(is_finished))) 2461 self.__is_predicted_cur_iter = [False for _ in range_(self.__num_dataset)] 2462 return is_finished.value == 1

KeyboardInterrupt:
复制代码

7.模型融合

下面把上一章关键流程在跑一边


# 导入包import warningswarnings.filterwarnings("ignore")import matplotlib.pyplot as pltplt.rcParams.update({'figure.max_open_warning': 0})import seaborn as sns
# modellingimport pandas as pdimport numpy as npfrom scipy import statsfrom sklearn.model_selection import train_test_splitfrom sklearn.model_selection import GridSearchCV, RepeatedKFold, cross_val_score,cross_val_predict,KFoldfrom sklearn.metrics import make_scorer,mean_squared_errorfrom sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNetfrom sklearn.svm import LinearSVR, SVRfrom sklearn.neighbors import KNeighborsRegressorfrom sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor,AdaBoostRegressorfrom xgboost import XGBRegressorfrom sklearn.preprocessing import PolynomialFeatures,MinMaxScaler,StandardScaler
复制代码


#load_datasetwith open("./zhengqi_train.txt")  as fr:    data_train=pd.read_table(fr,sep="\t")with open("./zhengqi_test.txt") as fr_test:    data_test=pd.read_table(fr_test,sep="\t")
复制代码


# 合并数据#merge train_set and test_setdata_train["oringin"]="train"data_test["oringin"]="test"data_all=pd.concat([data_train,data_test],axis=0,ignore_index=True)
复制代码


# 删除相关特征data_all.drop(["V5","V9","V11","V17","V22","V28"],axis=1,inplace=True)
复制代码


# 数据最大最小归一化# normalise numeric columnscols_numeric=list(data_all.columns)cols_numeric.remove("oringin")def scale_minmax(col):    return (col-col.min())/(col.max()-col.min())scale_cols = [col for col in cols_numeric if col!='target']data_all[scale_cols] = data_all[scale_cols].apply(scale_minmax,axis=0)
复制代码


# #Check effect of Box-Cox transforms on distributions of continuous variables# # 画图:探查特征和标签相关信息# fcols = 6# frows = len(cols_numeric)-1# plt.figure(figsize=(4*fcols,4*frows))# i=0
# for var in cols_numeric:# if var!='target':# dat = data_all[[var, 'target']].dropna() # i+=1# plt.subplot(frows,fcols,i)# sns.distplot(dat[var] , fit=stats.norm);# plt.title(var+' Original')# plt.xlabel('') # i+=1# plt.subplot(frows,fcols,i)# _=stats.probplot(dat[var], plot=plt)# plt.title('skew='+'{:.4f}'.format(stats.skew(dat[var])))# plt.xlabel('')# plt.ylabel('') # i+=1# plt.subplot(frows,fcols,i)# plt.plot(dat[var], dat['target'],'.',alpha=0.5)# plt.title('corr='+'{:.2f}'.format(np.corrcoef(dat[var], dat['target'])[0][1])) # i+=1# plt.subplot(frows,fcols,i)# trans_var, lambda_var = stats.boxcox(dat[var].dropna()+1)# trans_var = scale_minmax(trans_var) # sns.distplot(trans_var , fit=stats.norm);# plt.title(var+' Tramsformed')# plt.xlabel('') # i+=1# plt.subplot(frows,fcols,i)# _=stats.probplot(trans_var, plot=plt)# plt.title('skew='+'{:.4f}'.format(stats.skew(trans_var)))# plt.xlabel('')# plt.ylabel('') # i+=1# plt.subplot(frows,fcols,i)# plt.plot(trans_var, dat['target'],'.',alpha=0.5)# plt.title('corr='+'{:.2f}'.format(np.corrcoef(trans_var,dat['target'])[0][1]))
复制代码


对特征进行 Box-Cox 变换,使其满足正态性


Box-Cox 变换是 Box 和 Cox 在 1964 年提出的一种广义幂变换方法,是统计建模中常用的一种数据变换,用于连续的响应变量不满足正态分布的情况。Box-Cox 变换之后,可以一定程度上减小不可观测的误差和预测变量的相关性。Box-Cox 变换的主要特点是引入一个参数,通过数据本身估计该参数进而确定应采取的数据变换形式,Box-Cox 变换可以明显地改善数据的正态性、对称性和方差相等性,对许多实际数据都是行之有效的


cols_transform=data_all.columns[0:-2]for col in cols_transform:       # transform column    data_all.loc[:,col], _ = stats.boxcox(data_all.loc[:,col]+1)
复制代码


#  标签数据统计转换后的数据,计算分位数画图展示(基于正态分布)print(data_all.target.describe())
plt.figure(figsize=(12,4))plt.subplot(1,2,1)sns.distplot(data_all.target.dropna() , fit=stats.norm);plt.subplot(1,2,2)_=stats.probplot(data_all.target.dropna(), plot=plt)
复制代码


# 标签数据对数变换数据,使数据更符合正态,并画图展示#Log Transform SalePrice to improve normalitysp = data_train.targetdata_train.target1 =np.power(1.5,sp)print(data_train.target1.describe())
plt.figure(figsize=(12,4))plt.subplot(1,2,1)sns.distplot(data_train.target1.dropna(),fit=stats.norm);plt.subplot(1,2,2)_=stats.probplot(data_train.target1.dropna(), plot=plt)
复制代码


# 获取训练和测试数据# function to get training samplesdef get_training_data():    # extract training samples    from sklearn.model_selection import train_test_split    df_train = data_all[data_all["oringin"]=="train"]    df_train["label"]=data_train.target1    # split SalePrice and features    y = df_train.target    X = df_train.drop(["oringin","target","label"],axis=1)    X_train,X_valid,y_train,y_valid=train_test_split(X,y,test_size=0.3,random_state=100)    return X_train,X_valid,y_train,y_valid
# extract test data (without SalePrice)def get_test_data(): df_test = data_all[data_all["oringin"]=="test"].reset_index(drop=True) return df_test.drop(["oringin","target"],axis=1)
复制代码


# 评分函数from sklearn.metrics import make_scorer# metric for evaluationdef rmse(y_true, y_pred):    diff = y_pred - y_true    sum_sq = sum(diff**2)        n = len(y_pred)           return np.sqrt(sum_sq/n)def mse(y_ture,y_pred):    return mean_squared_error(y_ture,y_pred)
# scorer to be used in sklearn model fittingrmse_scorer = make_scorer(rmse, greater_is_better=False)mse_scorer = make_scorer(mse, greater_is_better=False)
复制代码


# 获取异常数据,并画图# function to detect outliers based on the predictions of a modeldef find_outliers(model, X, y, sigma=3):
# predict y values using model try: y_pred = pd.Series(model.predict(X), index=y.index) # if predicting fails, try fitting the model first except: model.fit(X,y) y_pred = pd.Series(model.predict(X), index=y.index) # calculate residuals between the model prediction and true y values resid = y - y_pred mean_resid = resid.mean() std_resid = resid.std()
# calculate z statistic, define outliers to be where |z|>sigma z = (resid - mean_resid)/std_resid outliers = z[abs(z)>sigma].index # print and plot the results print('R2=',model.score(X,y)) print('rmse=',rmse(y, y_pred)) print("mse=",mean_squared_error(y,y_pred)) print('---------------------------------------')
print('mean of residuals:',mean_resid) print('std of residuals:',std_resid) print('---------------------------------------')
print(len(outliers),'outliers:') print(outliers.tolist())
plt.figure(figsize=(15,5)) ax_131 = plt.subplot(1,3,1) plt.plot(y,y_pred,'.') plt.plot(y.loc[outliers],y_pred.loc[outliers],'ro') plt.legend(['Accepted','Outlier']) plt.xlabel('y') plt.ylabel('y_pred');
ax_132=plt.subplot(1,3,2) plt.plot(y,y-y_pred,'.') plt.plot(y.loc[outliers],y.loc[outliers]-y_pred.loc[outliers],'ro') plt.legend(['Accepted','Outlier']) plt.xlabel('y') plt.ylabel('y - y_pred');
ax_133=plt.subplot(1,3,3) z.plot.hist(bins=50,ax=ax_133) z.loc[outliers].plot.hist(color='r',bins=50,ax=ax_133) plt.legend(['Accepted','Outlier']) plt.xlabel('z') plt.savefig('outliers.png') return outliers
复制代码


# get training datafrom sklearn.linear_model import RidgeX_train, X_valid,y_train,y_valid = get_training_data()test=get_test_data()
# find and remove outliers using a Ridge modeloutliers = find_outliers(Ridge(), X_train, y_train)
# permanently remove these outliers from the data#df_train = data_all[data_all["oringin"]=="train"]#df_train["label"]=data_train.target1#df_train=df_train.drop(outliers)X_outliers=X_train.loc[outliers]y_outliers=y_train.loc[outliers]X_t=X_train.drop(outliers)y_t=y_train.drop(outliers)
复制代码


# 使用删除异常的数据进行模型训练def get_trainning_data_omitoutliers():    y1=y_t.copy()    X1=X_t.copy()    return X1,y1
复制代码


# 采用网格搜索训练模型from sklearn.preprocessing import StandardScalerdef train_model(model, param_grid=[], X=[], y=[],                 splits=5, repeats=5):
# get unmodified training data, unless data to use already specified if len(y)==0: X,y = get_trainning_data_omitoutliers() #poly_trans=PolynomialFeatures(degree=2) #X=poly_trans.fit_transform(X) #X=MinMaxScaler().fit_transform(X) # create cross-validation method rkfold = RepeatedKFold(n_splits=splits, n_repeats=repeats) # perform a grid search if param_grid given if len(param_grid)>0: # setup grid search parameters gsearch = GridSearchCV(model, param_grid, cv=rkfold, scoring="neg_mean_squared_error", verbose=1, return_train_score=True)
# search the grid gsearch.fit(X,y)
# extract best model from the grid model = gsearch.best_estimator_ best_idx = gsearch.best_index_
# get cv-scores for best model grid_results = pd.DataFrame(gsearch.cv_results_) cv_mean = abs(grid_results.loc[best_idx,'mean_test_score']) cv_std = grid_results.loc[best_idx,'std_test_score']
# no grid search, just cross-val score for given model else: grid_results = [] cv_results = cross_val_score(model, X, y, scoring="neg_mean_squared_error", cv=rkfold) cv_mean = abs(np.mean(cv_results)) cv_std = np.std(cv_results) # combine mean and std cv-score in to a pandas series cv_score = pd.Series({'mean':cv_mean,'std':cv_std})
# predict y using the fitted model y_pred = model.predict(X) # print stats on model performance print('----------------------') print(model) print('----------------------') print('score=',model.score(X,y)) print('rmse=',rmse(y, y_pred)) print('mse=',mse(y, y_pred)) print('cross_val: mean=',cv_mean,', std=',cv_std) # residual plots y_pred = pd.Series(y_pred,index=y.index) resid = y - y_pred mean_resid = resid.mean() std_resid = resid.std() z = (resid - mean_resid)/std_resid n_outliers = sum(abs(z)>3) plt.figure(figsize=(15,5)) ax_131 = plt.subplot(1,3,1) plt.plot(y,y_pred,'.') plt.xlabel('y') plt.ylabel('y_pred'); plt.title('corr = {:.3f}'.format(np.corrcoef(y,y_pred)[0][1])) ax_132=plt.subplot(1,3,2) plt.plot(y,y-y_pred,'.') plt.xlabel('y') plt.ylabel('y - y_pred'); plt.title('std resid = {:.3f}'.format(std_resid)) ax_133=plt.subplot(1,3,3) z.plot.hist(bins=50,ax=ax_133) plt.xlabel('z') plt.title('{:.0f} samples with z>3'.format(n_outliers))
return model, cv_score, grid_results
复制代码


# places to store optimal models and scoresopt_models = dict()score_models = pd.DataFrame(columns=['mean','std'])
# no. k-fold splitssplits=5# no. k-fold iterationsrepeats=5
复制代码

7.1 单一模型预测效果

7.1.1 岭回归


model = 'Ridge'
opt_models[model] = Ridge()alph_range = np.arange(0.25,6,0.25)param_grid = {'alpha': alph_range}
opt_models[model],cv_score,grid_results = train_model(opt_models[model], param_grid=param_grid, splits=splits, repeats=repeats)
cv_score.name = modelscore_models = score_models.append(cv_score)
plt.figure()plt.errorbar(alph_range, abs(grid_results['mean_test_score']), abs(grid_results['std_test_score'])/np.sqrt(splits*repeats))plt.xlabel('alpha')plt.ylabel('score')
复制代码

7.1.2 Lasso 回归

model = 'Lasso'
opt_models[model] = Lasso()alph_range = np.arange(1e-4,1e-3,4e-5)param_grid = {'alpha': alph_range}
opt_models[model], cv_score, grid_results = train_model(opt_models[model], param_grid=param_grid, splits=splits, repeats=repeats)
cv_score.name = modelscore_models = score_models.append(cv_score)
plt.figure()plt.errorbar(alph_range, abs(grid_results['mean_test_score']),abs(grid_results['std_test_score'])/np.sqrt(splits*repeats))plt.xlabel('alpha')plt.ylabel('score')
复制代码

7.1.3 ElasticNet 回归

model ='ElasticNet'opt_models[model] = ElasticNet()
param_grid = {'alpha': np.arange(1e-4,1e-3,1e-4), 'l1_ratio': np.arange(0.1,1.0,0.1), 'max_iter':[100000]}
opt_models[model], cv_score, grid_results = train_model(opt_models[model], param_grid=param_grid, splits=splits, repeats=1)
cv_score.name = modelscore_models = score_models.append(cv_score)
复制代码

7.1.4 SVR 回归

model='LinearSVR'opt_models[model] = LinearSVR()
crange = np.arange(0.1,1.0,0.1)param_grid = {'C':crange, 'max_iter':[1000]}
opt_models[model], cv_score, grid_results = train_model(opt_models[model], param_grid=param_grid, splits=splits, repeats=repeats)

cv_score.name = modelscore_models = score_models.append(cv_score)
plt.figure()plt.errorbar(crange, abs(grid_results['mean_test_score']),abs(grid_results['std_test_score'])/np.sqrt(splits*repeats))plt.xlabel('C')plt.ylabel('score')
复制代码

7.1.5 KNN 最近邻

model = 'KNeighbors'opt_models[model] = KNeighborsRegressor()
param_grid = {'n_neighbors':np.arange(3,11,1)}
opt_models[model], cv_score, grid_results = train_model(opt_models[model], param_grid=param_grid, splits=splits, repeats=1)
cv_score.name = modelscore_models = score_models.append(cv_score)
plt.figure()plt.errorbar(np.arange(3,11,1), abs(grid_results['mean_test_score']),abs(grid_results['std_test_score'])/np.sqrt(splits*1))plt.xlabel('n_neighbors')plt.ylabel('score')
复制代码

7.1.6 GBDT 模型

model = 'GradientBoosting'opt_models[model] = GradientBoostingRegressor()
param_grid = {'n_estimators':[150,250,350], 'max_depth':[1,2,3], 'min_samples_split':[5,6,7]}
opt_models[model], cv_score, grid_results = train_model(opt_models[model], param_grid=param_grid, splits=splits, repeats=1)
cv_score.name = modelscore_models = score_models.append(cv_score)
复制代码

7.1.7XGB 模型

model = 'XGB'opt_models[model] = XGBRegressor()
param_grid = {'n_estimators':[100,200,300,400,500], 'max_depth':[1,2,3], }
opt_models[model], cv_score,grid_results = train_model(opt_models[model], param_grid=param_grid, splits=splits, repeats=1)
cv_score.name = modelscore_models = score_models.append(cv_score)
复制代码

7.1.8 随机森林模型

model = 'RandomForest'opt_models[model] = RandomForestRegressor()
param_grid = {'n_estimators':[100,150,200], 'max_features':[8,12,16,20,24], 'min_samples_split':[2,4,6]}
opt_models[model], cv_score, grid_results = train_model(opt_models[model], param_grid=param_grid, splits=5, repeats=1)
cv_score.name = modelscore_models = score_models.append(cv_score)
复制代码

7.2 模型预测--多模型 Bagging

def model_predict(test_data,test_y=[],stack=False):    #poly_trans=PolynomialFeatures(degree=2)    #test_data1=poly_trans.fit_transform(test_data)    #test_data=MinMaxScaler().fit_transform(test_data)    i=0    y_predict_total=np.zeros((test_data.shape[0],))    for model in opt_models.keys():        if model!="LinearSVR" and model!="KNeighbors":            y_predict=opt_models[model].predict(test_data)            y_predict_total+=y_predict            i+=1        if len(test_y)>0:            print("{}_mse:".format(model),mean_squared_error(y_predict,test_y))    y_predict_mean=np.round(y_predict_total/i,3)    if len(test_y)>0:        print("mean_mse:",mean_squared_error(y_predict_mean,test_y))    else:        y_predict_mean=pd.Series(y_predict_mean)        return y_predict_mean
复制代码


# Bagging预测model_predict(X_valid,y_valid)
复制代码

7.3 模型融合 Stacking

7.3.1 模型融合 stacking 简单示例

import numpy as npimport matplotlib.pyplot as pltimport matplotlib.gridspec as gridspecimport itertoolsfrom sklearn.linear_model import LogisticRegressionfrom sklearn.svm import SVCfrom sklearn.ensemble import RandomForestClassifier
##主要使用pip install mlxtend安装mlxtendfrom mlxtend.classifier import EnsembleVoteClassifierfrom mlxtend.data import iris_datafrom mlxtend.plotting import plot_decision_regions%matplotlib inline
# Initializing Classifiersclf1 = LogisticRegression(random_state=0)clf2 = RandomForestClassifier(random_state=0)clf3 = SVC(random_state=0, probability=True)eclf = EnsembleVoteClassifier(clfs=[clf1, clf2, clf3], weights=[2, 1, 1], voting='soft')
# Loading some example dataX, y = iris_data()X = X[:,[0, 2]]
# Plotting Decision Regionsgs = gridspec.GridSpec(2, 2)fig = plt.figure(figsize=(10, 8))
for clf, lab, grd in zip([clf1, clf2, clf3, eclf], ['Logistic Regression', 'Random Forest', 'RBF kernel SVM', 'Ensemble'], itertools.product([0, 1], repeat=2)): clf.fit(X, y) ax = plt.subplot(gs[grd[0], grd[1]]) fig = plot_decision_regions(X=X, y=y, clf=clf, legend=2) plt.title(lab)plt.show()
复制代码


7.3.2 工业蒸汽多模型融合 stacking

from sklearn.model_selection import train_test_splitimport pandas as pdimport numpy as npfrom scipy import sparseimport xgboostimport lightgbm
from sklearn.ensemble import RandomForestRegressor,AdaBoostRegressor,GradientBoostingRegressor,ExtraTreesRegressorfrom sklearn.linear_model import LinearRegressionfrom sklearn.metrics import mean_squared_error
def stacking_reg(clf,train_x,train_y,test_x,clf_name,kf,label_split=None): train=np.zeros((train_x.shape[0],1)) test=np.zeros((test_x.shape[0],1)) test_pre=np.empty((folds,test_x.shape[0],1)) cv_scores=[] for i,(train_index,test_index) in enumerate(kf.split(train_x,label_split)): tr_x=train_x[train_index] tr_y=train_y[train_index] te_x=train_x[test_index] te_y = train_y[test_index] if clf_name in ["rf","ada","gb","et","lr","lsvc","knn"]: clf.fit(tr_x,tr_y) pre=clf.predict(te_x).reshape(-1,1) train[test_index]=pre test_pre[i,:]=clf.predict(test_x).reshape(-1,1) cv_scores.append(mean_squared_error(te_y, pre)) elif clf_name in ["xgb"]: train_matrix = clf.DMatrix(tr_x, label=tr_y, missing=-1) test_matrix = clf.DMatrix(te_x, label=te_y, missing=-1) z = clf.DMatrix(test_x, label=te_y, missing=-1) params = {'booster': 'gbtree', 'eval_metric': 'rmse', 'gamma': 1, 'min_child_weight': 1.5, 'max_depth': 5, 'lambda': 10, 'subsample': 0.7, 'colsample_bytree': 0.7, 'colsample_bylevel': 0.7, 'eta': 0.03, 'tree_method': 'exact', 'seed': 2017, 'nthread': 12 } num_round = 1000 #记得修改 early_stopping_rounds = 10 #修改 watchlist = [(train_matrix, 'train'), (test_matrix, 'eval') ] if test_matrix: model = clf.train(params, train_matrix, num_boost_round=num_round,evals=watchlist, early_stopping_rounds=early_stopping_rounds ) pre= model.predict(test_matrix,ntree_limit=model.best_ntree_limit).reshape(-1,1) train[test_index]=pre test_pre[i, :]= model.predict(z, ntree_limit=model.best_ntree_limit).reshape(-1,1) cv_scores.append(mean_squared_error(te_y, pre))
elif clf_name in ["lgb"]: train_matrix = clf.Dataset(tr_x, label=tr_y) test_matrix = clf.Dataset(te_x, label=te_y) #z = clf.Dataset(test_x, label=te_y) #z=test_x params = { 'boosting_type': 'gbdt', 'objective': 'regression_l2', 'metric': 'mse', 'min_child_weight': 1.5, 'num_leaves': 2**5, 'lambda_l2': 10, 'subsample': 0.7, 'colsample_bytree': 0.7, 'colsample_bylevel': 0.7, 'learning_rate': 0.03, 'tree_method': 'exact', 'seed': 2017, 'nthread': 12, 'silent': True, } num_round = 10000 early_stopping_rounds = 100 if test_matrix: model = clf.train(params, train_matrix,num_round,valid_sets=test_matrix, early_stopping_rounds=early_stopping_rounds ) pre= model.predict(te_x,num_iteration=model.best_iteration).reshape(-1,1) train[test_index]=pre test_pre[i, :]= model.predict(test_x, num_iteration=model.best_iteration).reshape(-1,1) cv_scores.append(mean_squared_error(te_y, pre)) else: raise IOError("Please add new clf.") print("%s now score is:"%clf_name,cv_scores) test[:]=test_pre.mean(axis=0) print("%s_score_list:"%clf_name,cv_scores) print("%s_score_mean:"%clf_name,np.mean(cv_scores)) return train.reshape(-1,1),test.reshape(-1,1)

复制代码

模型融合 stacking 基学习器


def rf_reg(x_train, y_train, x_valid, kf, label_split=None): randomforest = RandomForestRegressor(n_estimators=100, max_depth=20, n_jobs=-1, random_state=2017, max_features="auto",verbose=1) rf_train, rf_test = stacking_reg(randomforest, x_train, y_train, x_valid, "rf", kf, label_split=label_split) return rf_train, rf_test,"rf_reg"
def ada_reg(x_train, y_train, x_valid, kf, label_split=None): adaboost = AdaBoostRegressor(n_estimators=30, random_state=2017, learning_rate=0.01) ada_train, ada_test = stacking_reg(adaboost, x_train, y_train, x_valid, "ada", kf, label_split=label_split) return ada_train, ada_test,"ada_reg"
def gb_reg(x_train, y_train, x_valid, kf, label_split=None): gbdt = GradientBoostingRegressor(learning_rate=0.04, n_estimators=100, subsample=0.8, random_state=2017,max_depth=5,verbose=1) gbdt_train, gbdt_test = stacking_reg(gbdt, x_train, y_train, x_valid, "gb", kf, label_split=label_split) return gbdt_train, gbdt_test,"gb_reg"
def et_reg(x_train, y_train, x_valid, kf, label_split=None): extratree = ExtraTreesRegressor(n_estimators=100, max_depth=35, max_features="auto", n_jobs=-1, random_state=2017,verbose=1) et_train, et_test = stacking_reg(extratree, x_train, y_train, x_valid, "et", kf, label_split=label_split) return et_train, et_test,"et_reg"
def lr_reg(x_train, y_train, x_valid, kf, label_split=None): lr_reg=LinearRegression(n_jobs=-1) lr_train, lr_test = stacking_reg(lr_reg, x_train, y_train, x_valid, "lr", kf, label_split=label_split) return lr_train, lr_test, "lr_reg"
def xgb_reg(x_train, y_train, x_valid, kf, label_split=None): xgb_train, xgb_test = stacking_reg(xgboost, x_train, y_train, x_valid, "xgb", kf, label_split=label_split) return xgb_train, xgb_test,"xgb_reg"
def lgb_reg(x_train, y_train, x_valid, kf, label_split=None): lgb_train, lgb_test = stacking_reg(lightgbm, x_train, y_train, x_valid, "lgb", kf, label_split=label_split) return lgb_train, lgb_test,"lgb_reg"
复制代码

模型融合 stacking 预测

def stacking_pred(x_train, y_train, x_valid, kf, clf_list, label_split=None, clf_fin="lgb", if_concat_origin=True):    for k, clf_list in enumerate(clf_list):        clf_list = [clf_list]        column_list = []        train_data_list=[]        test_data_list=[]        for clf in clf_list:            train_data,test_data,clf_name=clf(x_train, y_train, x_valid, kf, label_split=label_split)            train_data_list.append(train_data)            test_data_list.append(test_data)            column_list.append("clf_%s" % (clf_name))    train = np.concatenate(train_data_list, axis=1)    test = np.concatenate(test_data_list, axis=1)        if if_concat_origin:        train = np.concatenate([x_train, train], axis=1)        test = np.concatenate([x_valid, test], axis=1)    print(x_train.shape)    print(train.shape)    print(clf_name)    print(clf_name in ["lgb"])    if clf_fin in ["rf","ada","gb","et","lr","lsvc","knn"]:        if clf_fin in ["rf"]:            clf = RandomForestRegressor(n_estimators=100, max_depth=20, n_jobs=-1, random_state=2017, max_features="auto",verbose=1)        elif clf_fin in ["ada"]:            clf = AdaBoostRegressor(n_estimators=30, random_state=2017, learning_rate=0.01)        elif clf_fin in ["gb"]:            clf = GradientBoostingRegressor(learning_rate=0.04, n_estimators=100, subsample=0.8, random_state=2017,max_depth=5,verbose=1)        elif clf_fin in ["et"]:            clf = ExtraTreesRegressor(n_estimators=100, max_depth=35, max_features="auto", n_jobs=-1, random_state=2017,verbose=1)        elif clf_fin in ["lr"]:            clf = LinearRegression(n_jobs=-1)        clf.fit(train, y_train)        pre = clf.predict(test).reshape(-1,1)        return pred    elif clf_fin in ["xgb"]:        clf = xgboost        train_matrix = clf.DMatrix(train, label=y_train, missing=-1)        test_matrix = clf.DMatrix(train, label=y_train, missing=-1)        params = {'booster': 'gbtree',                  'eval_metric': 'rmse',                  'gamma': 1,                  'min_child_weight': 1.5,                  'max_depth': 5,                  'lambda': 10,                  'subsample': 0.7,                  'colsample_bytree': 0.7,                  'colsample_bylevel': 0.7,                  'eta': 0.03,                  'tree_method': 'exact',                  'seed': 2017,                  'nthread': 12                  }        num_round = 1000        early_stopping_rounds = 10        watchlist = [(train_matrix, 'train'),                     (test_matrix, 'eval')                     ]        model = clf.train(params, train_matrix, num_boost_round=num_round,evals=watchlist,                          early_stopping_rounds=early_stopping_rounds                          )        pre = model.predict(test,ntree_limit=model.best_ntree_limit).reshape(-1,1)        return pre    elif clf_fin in ["lgb"]:        print(clf_name)        clf = lightgbm        train_matrix = clf.Dataset(train, label=y_train)        test_matrix = clf.Dataset(train, label=y_train)        params = {                  'boosting_type': 'gbdt',                  'objective': 'regression_l2',                  'metric': 'mse',                  'min_child_weight': 1.5,                  'num_leaves': 2**5,                  'lambda_l2': 10,                  'subsample': 0.7,                  'colsample_bytree': 0.7,                  'colsample_bylevel': 0.7,                  'learning_rate': 0.03,                  'tree_method': 'exact',                  'seed': 2017,                  'nthread': 12,                  'silent': True,                  }        num_round = 1000        early_stopping_rounds = 10        model = clf.train(params, train_matrix,num_round,valid_sets=test_matrix,                          early_stopping_rounds=early_stopping_rounds                          )        print('pred')        pre = model.predict(test,num_iteration=model.best_iteration).reshape(-1,1)        print(pre)        return pre
复制代码


# #load_datasetwith open("./zhengqi_train.txt")  as fr:    data_train=pd.read_table(fr,sep="\t")with open("./zhengqi_test.txt") as fr_test:    data_test=pd.read_table(fr_test,sep="\t")
复制代码


### K折交叉验证from sklearn.model_selection import StratifiedKFold, KFold
folds = 5seed = 1kf = KFold(n_splits=5, shuffle=True, random_state=0)
复制代码


### 训练集和测试集数据x_train = data_train[data_test.columns].valuesx_valid = data_test[data_test.columns].valuesy_train = data_train['target'].values
复制代码


### 使用lr_reg和lgb_reg进行融合预测clf_list = [lr_reg, lgb_reg]#clf_list = [lr_reg, rf_reg]
##很容易过拟合pred = stacking_pred(x_train, y_train, x_valid, kf, clf_list, label_split=None, clf_fin="lgb", if_concat_origin=True)print(pred )
复制代码

8.总结

本项目主要讲解了数据探索性分析:查看变量间相关性以及找出关键变量;数据特征工程对数据精进:异常值处理、归一化处理以及特征降维;在进行归回模型训练涉及主流 ML 模型:决策树、随机森林,lightgbm 等;在模型验证方面:讲解了相关评估指标以及交叉验证等;同时用 lgb 对特征进行优化;最后进行基于 stacking 方式模型融合。


原项目链接:https://www.heywhale.com/home/column/64141d6b1c8c8b518ba97dcc


参考链接:https://tianchi.aliyun.com/course/278/3427


本地端码源码见下方链接https://download.csdn.net/download/sinat_39620217/87630189


本人最近打算整合 ML、DRL、NLP 等相关领域的体系化项目课程,方便入门同学快速掌握相关知识。声明:部分项目为网络经典项目方便大家快速学习,后续会不断增添实战环节(比赛、论文、现实应用等)。


  • 对于机器学习这块规划为:基础入门机器学习算法--->简单项目实战--->数据建模比赛----->相关现实中应用场景问题解决。一条路线帮助大家学习,快速实战。

  • 对于深度强化学习这块规划为:基础单智能算法教学(gym 环境为主)---->主流多智能算法教学(gym 环境为主)---->单智能多智能题实战(论文复现偏业务如:无人机优化调度、电力资源调度等项目应用)

  • 自然语言处理相关规划:除了单点算法技术外,主要围绕知识图谱构建进行:信息抽取相关技术(含智能标注)--->知识融合---->知识推理---->图谱应用


上述对于你掌握后的期许:


  1. 对于 ML,希望你后续可以乱杀数学建模相关比赛(参加就获奖保底,top 还是难的需要钻研)

  2. 可以实际解决现实中一些优化调度问题,而非停留在 gym 环境下的一些游戏 demo 玩玩。(更深层次可能需要自己钻研了,难度还是很大的)

  3. 掌握可知识图谱全流程构建其中各个重要环节算法,包含图数据库相关知识。


这三块领域耦合情况比较大,后续会通过比如:搜索推荐系统整个项目进行耦合,各项算法都会耦合在其中。举例:知识图谱就会用到(图算法、NLP、ML 相关算法),搜索推荐系统(除了该领域召回粗排精排重排混排等算法外,还有强化学习、知识图谱等耦合在其中)。饼画的有点大,后面慢慢实现。


发布于: 2023-03-31阅读数: 5
用户头像

汀丶

关注

本博客将不定期更新关于NLP等领域相关知识 2022-01-06 加入

本博客将不定期更新关于机器学习、强化学习、数据挖掘以及NLP等领域相关知识,以及分享自己学习到的知识技能,感谢大家关注!

评论

发布
暂无评论
机器学习实战系列[一]:工业蒸汽量预测(最新版本下篇)含特征优化模型融合等_数据挖掘_汀丶_InfoQ写作社区