Kubernetes 宕机切换源码分析

2022 年 8 月 13 日
广东
本文字数：3014 字
阅读完需：约 10 分钟

K8s 对于 kubelet 宕机迁移的处理在不同的版本有不同的演进，所以网上很多文章关于如何加快这个时间的说法并不一致，甚至有些检索出来没什么用处。

早期搜索到一些文章，指定了一个关键参数 pod-eviction-timeout ，驱逐 pod 的等待时间，可是发现修改该参数无效，通过阅读源码，发现并没有使用到这个参数，怀疑是一个废弃的参数，通过翻阅很多资料后，发现不同的版本，是有不同的驱逐逻辑的。

<小于 1.13 版本：没有启用污点管理器特性时，Pod 的迁移由以下四个参数决定，
node-status-update-frequency, 节点上报频率，默认为 10s
node-monitor-period , node 控制器每隔多长时间监控一次 node 状态，默认为 5s
node-monitor-grace-period， node 控制器间隔多长时间后会将 Node 设置为 Not Ready ，默认为 40s
pod-eviction-timeout, node 控制器间隔多长时间后开始驱逐 Pod
版本大于等于 1.14 小于 1.18：默认启用污点管理器特性，通过污点管理器的机制驱逐 Pod
版本大于 1.18：必须启动污点管理器，其实旧的代码已经没有意义了

污点机制介绍

官方文档

节点亲和性是 Pod 的一种属性，它使 Pod 被吸引到一类特定的节点（这可能出于一种偏好，也可能是硬性要求）。
污点（Taint） 则相反——它使节点能够排斥一类特定的 Pod。
容忍度（Toleration） 是应用于 Pod 上的。容忍度允许调度器调度带有对应污点的节点。容忍度允许调度但并不保证调度：作为其功能的一部分，调度器也会评估其他参数。
污点和容忍度（Toleration）相互配合，可以用来避免 Pod 被分配到不合适的节点上。每个节点上都可以应用一个或多个污点，这表示对于那些不能容忍这些污点的 Pod，是不会被该节点接受的。

简单来说，按照污点和容忍的机制考虑，一切对于 Pod 的驱逐，都可以适用这套机制，包括由于 kubelet 故障导致的。

源码分析

1. 将 node 设置为 Not Ready

Node 控制器会周期性检查 node 的状态，如果发现有心跳时间超过了 node-monitor-grace-period的，就认为是不可达了，将给该节点赋予 Taint.

# node_lifecycle_controller.gomonitorNodeHealth()  // 1. 获取所有的node  --> nodes, err := nc.nodeLister.List(labels.Everything())  // 2. 根据心跳时间判断是否出现了Not Ready  --> gracePeriod, observedReadyCondition, currentReadyCondition, err = nc.tryUpdateNodeHealth(node)  // 3. 为node设置taint  --> nc.processTaintBaseEviction(node, &observedReadyCondition)

复制代码

2. 监听 Node 更新事件，触发驱逐

一旦 Node 被赋予了 Taint，那么已经注册在 NodeLifecycleController 中的 nodeInformer 就会监听到该事件，并将该 node 对象传入tc.nodeUpdateChannels ；

tc 是 NoExecuteTaintManager 污点管理对象，它会监听 tc.nodeUpdateChannels ，将 node 传给方法 tc.handleNodeUpdate ，然后查询 node 中的所有 Pod，调用 tc.processPodOnNode 方法进行处理；

processPodOnNode 会创建一个TimedWorker 对象，这是一个具备定时执行能力的对象，当时间到了就会调用指定的方法： deletePodHandler，对 Pod 进行驱逐。

那么 TimedWorker 的定时时间是多少呢，污点管理器会求一个 minTolerationTime, 也就是最小容忍时间。这个容忍时间会找到 pod.Spec.Tolerations 中的容忍时间。

那么 Pod 中的这个容忍时间是什么时候写入的呢？

3. 默认容忍时间

我们执行 kubectl describe pod xxx，会发现 Pod 中已经写入了一个针对污点 node.kubernetes.io/not-ready:NoExecute 和 node.kubernetes.io/unreachable 的容忍，并且指定了容忍时间为 300s。

Tolerations:                 node-role.kubernetes.io/master:NoSchedule                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s

复制代码

通过查阅资料，发现 Pod 中的默认驱逐污点是 API-Server 设置的。

Kubernetes 会自动给 Pod 添加针对 node.kubernetes.io/not-ready 和 node.kubernetes.io/unreachable 的容忍度，且配置 tolerationSeconds=300，除非用户自身或者某控制器显式设置此容忍度。
这些自动添加的容忍度意味着 Pod 可以在检测到对应的问题之一时，在 5 分钟内保持绑定在该节点上。

kube-apiserver参数片段

--default-not-ready-toleration-seconds int Default: 300
Indicates the tolerationSeconds of the toleration for notReady:NoExecute that is added by default to every pod that does not already have such a toleration.
--default-unreachable-toleration-seconds int Default: 300
Indicates the tolerationSeconds of the toleration for unreachable:NoExecute that is added by default to every pod that does not already have such a toleration.

4. API-Server 配置默认容忍

plugin/pkg/admission/defaulttolerationseconds/admission.go:43

var (  defaultNotReadyTolerationSeconds = flag.Int64("default-not-ready-toleration-seconds", 300,    "Indicates the tolerationSeconds of the toleration for notReady:NoExecute"+      " that is added by default to every pod that does not already have such a toleration.")
  defaultUnreachableTolerationSeconds = flag.Int64("default-unreachable-toleration-seconds", 300,    "Indicates the tolerationSeconds of the toleration for unreachable:NoExecute"+      " that is added by default to every pod that does not already have such a toleration.")
  notReadyToleration = api.Toleration{    Key:               v1.TaintNodeNotReady,    Operator:          api.TolerationOpExists,    Effect:            api.TaintEffectNoExecute,    TolerationSeconds: defaultNotReadyTolerationSeconds,  }
  unreachableToleration = api.Toleration{    Key:               v1.TaintNodeUnreachable,    Operator:          api.TolerationOpExists,    Effect:            api.TaintEffectNoExecute,    TolerationSeconds: defaultUnreachableTolerationSeconds,  })
// Admit makes an admission decision based on the request attributesfunc (p *Plugin) Admit(ctx context.Context, attributes admission.Attributes, o admission.ObjectInterfaces) (err error) {......    if !toleratesNodeNotReady {        pod.Spec.Tolerations = append(pod.Spec.Tolerations, notReadyToleration)      }
      if !toleratesNodeUnreachable {        pod.Spec.Tolerations = append(pod.Spec.Tolerations, unreachableToleration)      }......}

复制代码

应该是 api-server 准入判断中增加的逻辑，默认给 pod 增加了容忍污点。

总结

通过查阅资料和源码，总算搞清楚了 Pod 的宕机驱逐逻辑实现，可谓是天马行空、羚羊挂角，从 kubelet 的心跳到 controler-manager 中的 node 控制器的监听，再到 api-server 对 pod 的默认污点，还包含 scheduler 不再调度到该 node 的设定，基本涵盖了所有的控制组件了。

并且其中大量使用 channel，队列，解耦做的非常彻底，但是源码的阅读也增加了不少困难。社区随着版本迭代也在不断的对代码进行优化，重构，摸清 k8s 的实现机制，是一个有趣且富有挑战的工作。