随机森林 - 分类森林

作者：烧灯续昼2002

2022-11-06
山东
本文字数：3811 字
阅读完需：约 13 分钟

RandomForestClassifier(    ["n_estimators='warn'", "criterion='gini'", 'max_depth=None', 'min_samples_split=2', 'min_samples_leaf=1', 'min_weight_fraction_leaf=0.0', "max_features='auto'", 'max_leaf_nodes=None', 'min_impurity_decrease=0.0', 'min_impurity_split=None', 'bootstrap=True', 'oob_score=False', 'n_jobs=None', 'random_state=None', 'verbose=0', 'warm_start=False', 'class_weight=None'],)

复制代码

基础实现

from sklearn.tree import DecisionTreeClassifierfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.datasets import load_winefrom sklearn.model_selection import train_test_splitfrom sklearn.model_selection import cross_val_scoreimport numpy as npimport matplotlib.pyplot as plt

复制代码

建立森林

# 数据准备wine = load_wine()Xtrain, Xtest, Ytrain, Ytest = train_test_split(wine.data, wine.target, test_size=0.3)
# 实例化模型clf = DecisionTreeClassifier(random_state=0)rfc = RandomForestClassifier(random_state=0)
# 训练集代入实例化后的模型取进行训练，使用的接口是fitclf = clf.fit(Xtrain,Ytrain)rfc = rfc.fit(Xtrain,Ytrain)
# 使用其他接口将测试集导入训练好的模型，获取我们希望获取的结果（score，Y_test）score_c = clf.score(Xtest,Ytest)score_r = rfc.score(Xtest,Ytest)print("Single Tree:{}".format(score_c)     ,"Random Forest:{}.".format(score_r))---Single Tree:0.9259259259259259 Random Forest:0.9629629629629629.

复制代码

画出随机森林和决策树在一组交叉验证下的效果对比

rfc = RandomForestClassifier(n_estimators=25)rfc_s = cross_val_score(rfc,wine.data,wine.target,cv=10)
clf = DecisionTreeClassifier()clf_s = cross_val_score(clf,wine.data,wine.target,cv=10)
plt.plot(range(1,11),rfc_s,label="RandomForest")plt.plot(range(1,11),clf_s,label="DecisionTree")plt.legend()plt.show()

复制代码

重要参数

n_estimators

默认值为 10，表示森林中树木的数量，一般来说 n_estimators 越大，模型的效果往往越好但相应的，任何模型都有决策边界，n_estimators 达到一定的程度之后，随机森林的精确性往往不在上升或开始波动，并且 n_estimators 越大，需要的计算量和内存也越大，训练的时间也会越来越长因此我们要找到训练难度和模型效果之间的平衡

superpa = []for i in range(100):    rfc = RandomForestClassifier(n_estimators=i+1,n_jobs=-1)    rfc_s = cross_val_score(rfc,wine.data,wine.target,cv=10).mean()    superpa.append(rfc_s)
print(max(superpa),superpa.index(max(superpa))+1)plt.figure(figsize=[10,5])plt.plot(range(1,101),superpa)plt.show()---0.9888888888888889 32

复制代码

n_estimators 在现有版本的默认值是 10，在即将更新的 0.22 版本中将会被修正为 100

random_state

随机森林中其实也有 random_state，用法和分类树中相似，只不过在分类树中，一个 random_state 只控制生成一棵树，而随机森林中的 random_state 控制的是生成森林的模式

rfc = RandomForestClassifier(n_estimators=10,random_state=0)rfc = rfc.fit(Xtrain,Ytrain)

复制代码

注意我们这里的 random_state 是随机森林的，不是每个决策树的，随机森林的 random_state 决定每一棵树的 random_state，前者固定，后者随之固定，如果固定了随机森林的 random_state 那么里面每棵决策树的 random_state 也会固定，但不一定是 0

# 随机森林的重要属性之一：estimators_，返回列表，列表中的元素是森林当中的每一棵树，后面会讲rfc.estimators_[0]---DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,            max_features='auto', max_leaf_nodes=None,            min_impurity_decrease=0.0, min_impurity_split=None,            min_samples_leaf=1, min_samples_split=2,            min_weight_fraction_leaf=0.0, presort=False,            random_state=209652396, splitter='best')

复制代码

这里补充一点无论是 DataFrame、Series 还是 ndarray，其中的每个元素不是数值型，就是字符串类型，没有自定义类型，如果用 rfc.estimators_生成一个 Series，那么本来的 DecisionTreeClassifier 对象就会变成一个字符串，无法调用方法和属性等

# 多次执行，发现是固定的，这是因为我们固定了随机森林的random_statefor i in range(len(rfc.estimators_)):    print(rfc.estimators_[i].random_state)---2096523963987645919242312851478610112441365315153736473119277177914914348551819583497530702035

复制代码

不同的决策树 random_state 不一样体现了用“随机挑选特征进行分枝“的方法得到的随机性。并且我们可以证明，当这种随机性越大的时候，袋装法的效果一般会越来越好。用袋装法集成时，基分类器应当是相互独立的，是不相同的。

但这种做法的局限性是很强的，当我们需要成千上万棵树的时候，数据不一定能够提供成千上万的特征来让我们构筑尽量多尽量不同的树。因此，除了 random_state。我们还需要其他的随机性。

貌似决策树的 random_state 并不影响根节点以外节点的特征选择，这里以后会验证一下

boostrap

默认 True，接收布尔值，在建立森林的时候，是否使用构建自助集的方式，一般用于 n 和 n_estimators 足够大的时候（前面有解释）一般不会改，就使用默认 True

oob_score

默认 False，在建立森林的时候，是否采用袋外样本来测试模型的准确性。指定为 True 并在模型实例化训练完后，可以用 oob_score_来查看袋外数据上测试的准确性，且这种方法需划分训练集和测试集，直接使用袋外数据即可

rfc = RandomForestClassifier(n_estimators=25,oob_score=True)rfc = rfc.fit(wine.data,wine.target)
rfc.oob_score_---0.9831460674157303

复制代码

重要属性

之前用到了.estimators_和.oob_score 作为树模型的集成算法，随机森林还有.feature_importances_这个属性

rfc = RandomForestClassifier(n_estimators=25)rfc = rfc.fit(Xtrain,Ytrain)rfc.feature_importances_ # 随机森林每个特征的重要性---array([0.09518147, 0.02570141, 0.02701417, 0.03267217, 0.05957439,       0.10093973, 0.12028216, 0.0205655 , 0.01375265, 0.17381093,       0.08947711, 0.11037468, 0.13065362])

复制代码

重要接口

与决策树相同，有 apply、fit、predic、score，还多一个 predict_proba，这个接口返回每个测试样本对应的被分到每一类标签的概率，有几个标签就返回几个概率。返回对象的 shape 是 $森林中树的数量 \times 标签数量$

Xtest.shape,rfc.apply(Xtest).shape---((54, 13), (54, 25))
rfc.apply(Xtest) # 每个样本在随机森林的每棵树中叶结点的序号# 每一行是一个样本，每一列是一个树---array([[ 8,  5,  3, ...,  3, 13,  2],       [10,  1,  6, ...,  6,  4, 10],       [ 3,  1,  6, ...,  6,  4, 11],       ...,       [ 8,  4, 13, ...,  6,  4,  8],       [10,  1,  2, ...,  6,  4,  8],       [10,  1, 12, ...,  6,  4, 10]], dtype=int64)
rfc.predict(Xtest) # 预测标签结果---array([2, 1, 1, 2, 2, 0, 1, 2, 2, 0, 0, 2, 0, 0, 0, 1, 0, 1, 0, 0, 1, 2,       1, 1, 2, 1, 0, 0, 2, 1, 1, 1, 2, 2, 1, 0, 1, 1, 2, 1, 1, 1, 1, 0,       1, 2, 0, 0, 2, 0, 1, 1, 1, 1])
rfc.predict_proba(Xtest).shape---(54, 3)
rfc.predict_proba(Xtest)# 每个样本被分到每个标签的概率# 这里只截了前五行# 每一行是一个样本，每一列是每个标签该样本的概率# 可以与上面rfc.predict(Xtest)的前五个结果对应，2, 1, 1, 2, 2，即哪个标签概率大，则预测结果为哪一个---array([[0.  , 0.  , 1.  ],       [0.  , 1.  , 0.  ],       [0.  , 0.84, 0.16],       [0.24, 0.2 , 0.56],       [0.04, 0.04, 0.92],

复制代码

使用 Bagging 的另一个必要条件

之前说过，使用袋装法要求基评估器尽量独立。其实还有另一个必要条件：基分类器的判断准确率至少超过随机分类器，也就是说，基分类器的判断准确率至少要超过 50%

绘制基分类器误差率为 $ϵ$ 的随机森林和决策树的误差率

import numpy as npimport matplotlib.pyplot as pltfrom scipy.special import comb
x = np.linspace(0,1,20)y = []for epsilon in x:    E = np.array([comb(25,i)*(epsilon**i)*((1-epsilon)**(25-i))                  for i in range(13,26)]).sum()    y.append(E)plt.plot(x,y,'o-',label='when estimators are different')# 也就是25棵树都不一样的随机森林plt.plot(x,x,'--',color='red',label='if all estimators are same')# 也就是25棵树都一样的随机森林plt.xlabel("individual estimator's error")plt.ylabel("RandomForest's error")plt.legend()plt.show()

复制代码

从图像中可以看出，当基分类器的误差率小于 0.5，即准确率大于 0.5 时，集成的效果比基分类器要好。相反，当基分类器的误差率大于 0.5，袋装集成算法就失效了所以在使用随机森林之前，一定要检查，用来组成随机森林的分类树们是否都有至少 50%的预测准确率

回归森林和分类森林基本一样，在后面的例子中会进行补充，不再单独写一节

视频作者：菜菜TsaiTsai链接：【技术干货】菜菜的机器学习sklearn【全85集】Python进阶_哔哩哔哩_bilibili

发布于: 刚刚阅读数: 4

原文链接:【http://xie.infoq.cn/article/3b06cc4a14dddefdd04b33d7d】。

烧灯续昼2002

关注

还未添加个人签名 2022-09-14 加入

还未添加个人简介

发布

暂无评论

创作场景

随机森林 - 分类森林

基础实现

重要参数

n_estimators

random_state

boostrap

oob_score

重要属性

重要接口

使用 Bagging 的另一个必要条件

烧灯续昼2002

评论