前面有篇文章说明了些在 k8s 节点不健康时, controller-manger 会主动的驱逐 pod 到别的健康节点,本质上主要依靠 node.kubernetes.io/not-ready:NoExecute 和 node.kubernetes.io/unreachable: NoExecute 这两个污点, 节点上的 POD 将逐步被转移, 那么都什么时候会打上这两个污点那?
kubelet 未能如期上报状态,Ready、MemoryPressure, DiskPressure 等变成 unknown, 同时污点
node.kubernetes.io/unreachable: NoExecute
kubelet 主动上报了 not ready 的 condition, 污点 node.kubernetes.io/not-ready:NoExecute
从下面的函数中 能大概看出,什么情况下会主动 not ready
nodestatus.ReadyCondition(kl.clock.Now, kl.runtimeState.runtimeErrors,
kl.runtimeState.networkErrors, kl.runtimeState.storageErrors,
validateHostFunc, kl.containerManager.Status,
kl.shutdownManager.ShutdownStatus, kl.recordNodeStatusEvent),
复制代码
当 kubelet 发现自身节点压力时,会主动的驱逐 POD,恢复压力状态, 涉及的参数有
eviction-hard : 硬驱逐条件 , 如默认 memory.available<100Mi,nodefs.available<10%,imagefs.available<15%,nodefs.inodesFree<5%
eviction-soft: 软驱逐条件
eviction-soft-grace-period: 软驱逐宽限期
eviction-max-pod-grace-period:软驱逐条件时,驱逐 POD 的 最大 graceful time, 和 pod 自身的 graceful time 取最小者, 触发硬性驱逐条件时,则立即驱逐,无 graceful time
housekeeping-interval: 压力状态检查的周期, 在驱逐 POD 时, 每个周期驱逐一个 POD,所以这里也影响了驱逐时 POD 的速度,默认 10s
eviction-pressure-transition-period: 在某一个压力状态下至少维持多久,防止频繁抖动
从上面的处理流程能看出
当有条件压力时,如磁盘压力,节点会上报 DiskPressure 的条件状态,控制器会根据条件,打上 disk:NoSchedule 的污点,这样新的 POD 就不会调度过来了,污点的效果是 NoSchedule 所以控制器不会主动驱逐节点上的 POD
压力状态时,kubelet 也会开启拒绝保护,不在允许新的 POD 再进来,这也是有时我们看到在压力的临界状态时,偶尔会产生很多 Evict 状态的 POD 的原因,控制器的压力污点标记肯定是慢于拒绝保护的, 在资源紧张时,新 pod 只能选择该节点时,就会产生该现象,直至污点打上位置
evcitManager 还负责检查了本地存储的上限,超过上线的都会直接被驱逐,且没有 graceful time,比如 empty dir 超过 limit size
在软驱逐宽限时间里,只更新压力状态,不会有任何动作
宽限期过之后,首先清理节点级的资源,删除不用的 container,不用的镜像
还是不好的话,最终驱逐 pod,每个周期驱逐一个, 在软驱逐范围内有 graceful time,到达硬驱逐范围,直接驱逐
这里在驱逐 POD 时,有个排序的问题
多个压力条件发生时,要排序优先处理哪种压力,内存优先
相同压力条件下,也要排序 POD 的驱逐顺序,看下面几个排序函数, 排序的第一个因素是 资源是否超过 request,Guaranteed 类型的 POD ,资源不可能超过 request,所以 Guaranteed 类型的 pod 也就排在了后面,其他类型的按照优先级排序,最后按照超过使用量; 所以优先级分类之后,是第一要素
// rankMemoryPressure orders the input pods for eviction in response to memory pressure.
// It ranks by whether or not the pod's usage exceeds its requests, then by priority, and
// finally by memory usage above requests.
func rankMemoryPressure(pods []*v1.Pod, stats statsFunc) {
orderedBy(exceedMemoryRequests(stats), priority, memory(stats)).Sort(pods)
}
// rankPIDPressure orders the input pods by priority in response to PID pressure.
func rankPIDPressure(pods []*v1.Pod, stats statsFunc) {
orderedBy(priority, process(stats)).Sort(pods)
}
// rankDiskPressureFunc returns a rankFunc that measures the specified fs stats.
func rankDiskPressureFunc(fsStatsToMeasure []fsStatsType, diskResource v1.ResourceName) rankFunc {
return func(pods []*v1.Pod, stats statsFunc) {
orderedBy(exceedDiskRequests(stats, fsStatsToMeasure, diskResource), priority, disk(stats, fsStatsToMeasure, diskResource)).Sort(pods)
}
}
复制代码
还有一类 特殊的 POD,不受 kubelet 驱逐的影响,这类 POD 属于系统级别,必须存在在节点上;判断的条件如下
// IsCriticalPod returns true if pod's priority is greater than or equal to SystemCriticalPriority.
func IsCriticalPod(pod *v1.Pod) bool {
if IsStaticPod(pod) {
return true
}
if IsMirrorPod(pod) {
return true
}
if pod.Spec.Priority != nil && IsCriticalPodBasedOnPriority(*pod.Spec.Priority) {
return true
}
return false
}
复制代码
这类 POD 想要达到必须存在的目的,必须在流程上开绿灯
资源不足时,也需要有高优先级抢占的能力,这在 scheduler 层面解决
调度时还需要容忍 NoSchedule 的污点,daemonset 会自动添加污点
kubelet 压力驱逐时,会忽略该类型的 POD,在磁盘使用超限情况下,也会忽略
拒绝保护机制,也要准入该类型的 POD
func (m *managerImpl) Admit(attrs *lifecycle.PodAdmitAttributes) lifecycle.PodAdmitResult {
m.RLock()
defer m.RUnlock()
if len(m.nodeConditions) == 0 {
return lifecycle.PodAdmitResult{Admit: true}
}
// Admit Critical pods even under resource pressure since they are required for system stability.
// https://github.com/kubernetes/kubernetes/issues/40573 has more details.
if kubelettypes.IsCriticalPod(attrs.Pod) {
return lifecycle.PodAdmitResult{Admit: true}
}
......
}
复制代码
准入了之后,也可能资源不能满足了, 依然需要抢占
// HandleAdmissionFailure gracefully handles admission rejection, and, in some cases,
// to allow admission of the pod despite its previous failure.
func (c *CriticalPodAdmissionHandler) HandleAdmissionFailure(admitPod *v1.Pod, failureReasons []lifecycle.PredicateFailureReason) ([]lifecycle.PredicateFailureReason, error) {
if !kubetypes.IsCriticalPod(admitPod) {
return failureReasons, nil
}
// InsufficientResourceError is not a reason to reject a critical pod.
// Instead of rejecting, we free up resources to admit it, if no other reasons for rejection exist.
nonResourceReasons := []lifecycle.PredicateFailureReason{}
resourceReasons := []*admissionRequirement{}
for _, reason := range failureReasons {
if r, ok := reason.(*lifecycle.InsufficientResourceError); ok {
resourceReasons = append(resourceReasons, &admissionRequirement{
resourceName: r.ResourceName,
quantity: r.GetInsufficientAmount(),
})
} else {
nonResourceReasons = append(nonResourceReasons, reason)
}
}
if len(nonResourceReasons) > 0 {
// Return only reasons that are not resource related, since critical pods cannot fail admission for resource reasons.
return nonResourceReasons, nil
}
err := c.evictPodsToFreeRequests(admitPod, admissionRequirementList(resourceReasons))
// if no error is returned, preemption succeeded and the pod is safe to admit.
return nil, err
}
复制代码
评论