探究 kubernetes 探针参数 periodSeconds 和 timeoutSeconds

2024-07-11
福建
本文字数：3284 字
阅读完需：约 11 分钟

问题起源

kubernetes probes 的配置中有两个容易混淆的参数，periodSeconds和timeoutSeconds，其配置方式如下：

apiVersion: v1kind: Podmetadata:  name: darwin-appspec:  containers:  - name: darwin-container    image: darwin-image    livenessProbe:      httpGet:        path: /darwin-path        port: 8080      initialDelaySeconds: 60      periodSeconds: 10      timeoutSeconds: 5      failureThreshold: 3

复制代码

官方对这两个参数的解释如下：

periodSeconds: How often (in seconds) to perform the probe. Default to 10 seconds. The minimum value is 1.
timeoutSeconds: Number of seconds after which the probe times out. Defaults to 1 second. Minimum value is 1.

意思是说periodSeconds表示执行探针的周期，而timeoutSeconds表示执行探针的超时时间。

网上有不少针对这两个参数的讨论(如下)，其中涉及到一个问题，如果timeoutSeconds > periodSeconds 会怎么样？

What is the role of timeoutSeconds in kubernetes liveness/readiness probes?
Kubernetes Health Check: timeoutSeconds exceeds periodSeconds
Does periodSeconds in Kubernetes probe configuration count from the last probe time or the last response/failure time?

其中在上面的第 3 篇中对timeoutSeconds>periodSeconds的情况有如下描述，即在这种情况下，如果探针超时，则探针周期等于timeoutSeconds。那么这种说法是否正确呢？

If you had the opposite (timeoutSeconds=10, periodSeconds=5), then the probes would look as follows:

0s: liveness probe initiated10s: liveness probe times out10s: liveness probe initiated again

复制代码

源码探究

鉴于网上众说纷纭，我们通过源码来一探究竟。

kubernetes 的探针机制是由 kubelet 执行的，目前支持exec、grpc、httpGet、tcpSocket这 4 种探针方式。

探针的代码逻辑并不复杂，以 v1.30.2 的代码为例，其入口函数如下，可以看到它会启动一个周期为w.spec.PeriodSeconds(即探针中定义的periodSeconds)定时器，周期性地执行探针。

// run periodically probes the container.func (w *worker) run() {	ctx := context.Background()	probeTickerPeriod := time.Duration(w.spec.PeriodSeconds) * time.Second	... 	probeTicker := time.NewTicker(probeTickerPeriod)	...probeLoop:	for w.doProbe(ctx) {		// Wait for next probe tick.		select {		case <-w.stopCh:			break probeLoop		case <-probeTicker.C:		case <-w.manualTriggerCh:			// continue		}	}}

复制代码

现在已经找到periodSeconds的用途，下一步需要找到timeoutSeconds。

首先进入doProbe函数，它调用了w.probeManager.prober.probe：

// doProbe probes the container once and records the result.// Returns whether the worker should continue.func (w *worker) doProbe(ctx context.Context) (keepGoing bool) {	...	// Note, exec probe does NOT have access to pod environment variables or downward API	result, err := w.probeManager.prober.probe(ctx, w.probeType, w.pod, status, w.container, w.containerID)	if err != nil {		// Prober error, throw away the result.		return true	}	...}

复制代码

下面的probe函数用于执行一个特定的探针。需要注意的是，它调用了pb.runProbeWithRetries，其中maxProbeRetries值为 3，说明在一个周期(periodSeconds)中最多可以执行 3 次探针命令：

// probe probes the container.func (pb *prober) probe(ctx context.Context, probeType probeType, pod *v1.Pod, status v1.PodStatus, container v1.Container, containerID kubecontainer.ContainerID) (results.Result, error) {	var probeSpec *v1.Probe	switch probeType {	case readiness:		probeSpec = container.ReadinessProbe	case liveness:		probeSpec = container.LivenessProbe	case startup:		probeSpec = container.StartupProbe	default:		return results.Failure, fmt.Errorf("unknown probe type: %q", probeType)	}	...	result, output, err := pb.runProbeWithRetries(ctx, probeType, probeSpec, pod, status, container, containerID, maxProbeRetries)	...}

复制代码

runProbeWithRetries的注释说明，可能会执行多次探针，直到探针返回成功或全部尝试失败：

// runProbeWithRetries tries to probe the container in a finite loop, it returns the last result// if it never succeeds.func (pb *prober) runProbeWithRetries(ctx context.Context, probeType probeType, p *v1.Probe, pod *v1.Pod, status v1.PodStatus, container v1.Container, containerID kubecontainer.ContainerID, retries int) (probe.Result, string, error) {	...	for i := 0; i < retries; i++ {		result, output, err = pb.runProbe(ctx, probeType, p, pod, status, container, containerID)	  ...	}	...}

复制代码

在runProbe函数中，最终找到了timeoutSeconds对应的参数p.TimeoutSeconds，其作为各个探针命令的超时参数，如在httpGet类型的探针中，它作为了httpClient的请求超时时间：

 func (pb *prober) runProbe(ctx context.Context, probeType probeType, p *v1.Probe, pod *v1.Pod, status v1.PodStatus, container v1.Container, containerID kubecontainer.ContainerID) (probe.Result, string, error) {   timeout := time.Duration(p.TimeoutSeconds) * time.Second  	if p.Exec != nil {		command := kubecontainer.ExpandContainerCommandOnlyStatic(p.Exec.Command, container.Env)		return pb.exec.Probe(pb.newExecInContainer(ctx, container, containerID, command, timeout))	}  	if p.HTTPGet != nil {		req, err := httpprobe.NewRequestForHTTPGetAction(p.HTTPGet, &container, status.PodIP, "probe")		...		return pb.http.Probe(req, timeout)	}  	if p.TCPSocket != nil {		port, err := probe.ResolveContainerPort(p.TCPSocket.Port, &container)		...		host := p.TCPSocket.Host		if host == "" {			host = status.PodIP		}		return pb.tcp.Probe(host, port, timeout)	} 	if utilfeature.DefaultFeatureGate.Enabled(kubefeatures.GRPCContainerProbe) && p.GRPC != nil {		host := status.PodIP		service := ""		if p.GRPC.Service != nil {			service = *p.GRPC.Service		}		return pb.grpc.Probe(host, service, int(p.GRPC.Port), timeout)	}	...}

复制代码

至此我们可以拼接出periodSeconds和timeoutSeconds的关系，其逻辑关系与如下代码类似。

probeTicker := time.NewTicker(periodSeconds) for {	select {	case <-probeTicker.C:    for i := 0; i < 3; i++ {      if ok:=probe(timeoutSeconds);ok{        return      }    }}

复制代码

总结

periodSeconds用于启动一个周期性调用探针命令的定时器，而timeoutSeconds作为探针命令的超时参数
timeoutSeconds和periodSeconds之间并没有明确的关系。如果timeoutSeconds=10s，periodSeconds=5s，则本次探针周期可能为[5s, 30s)之内的任意值，并不是该文中说的periodSeconds=timeoutSeconds(由于本文写于 3 年前，经查阅v1.19.10版本代码，逻辑上与现有版本代码相同。)
由于健康检查的逻辑大部分都不会很复杂，如检查某个文件是否存在，检查服务的/hleathz http endpoint 是否可以访问等，因此建议将timeoutSeconds设置为一个小于periodSeconds的合理的值。

`failureThreshold/successThreshold`和`maxProbeRetries`的关系

maxProbeRetries用于定义一次探针周期内探针命令执行的最大尝试次数；
如果在一个探针周期内，探针命令返回成功，则successThreshold 加 1，反之failureThreshold加 1；

文章转载自：charlieroro
原文链接：https://www.cnblogs.com/charlieroro/p/18294255
体验地址：http://www.jnpfsoft.com/?from=infoq

发布于: 刚刚阅读数: 3

不在线第一只蜗牛

关注

还未添加个人签名 2023-06-19 加入

还未添加个人简介

发布

暂无评论

创作场景

探究 kubernetes 探针参数 periodSeconds 和 timeoutSeconds

问题起源

源码探究

总结

failureThreshold/successThreshold和maxProbeRetries的关系

不在线第一只蜗牛

评论

`failureThreshold/successThreshold`和`maxProbeRetries`的关系