kubernetes/k8s CRI 分析 -kubelet 删除 pod 分析

关注

发布于: 1 小时前

kubernetes/k8s CRI 分析 -kubelet 删除 pod 分析

关联博客：《kubernetes/k8s CRI 分析-容器运行时接口分析》https://xie.infoq.cn/article/0d2bc16c22af9fc710b9f9c30

《kubernetes/k8s CRI 分析-kubelet 创建 pod 分析》https://xie.infoq.cn/article/d86b446f69ed5ab769aa6d3c9

之前的博文先对 CRI 做了介绍，然后对 kubelet CRI 相关源码包括 kubelet 组件 CRI 相关启动参数分析、CRI 相关 interface/struct 分析、CRI 相关初始化分析、kubelet 调用 CRI 创建 pod 分析 4 个部分进行了分析，没有看的小伙伴，可以点击上面的链接去看一下。

把之前博客分析到的 CRI 架构图再贴出来一遍。

本篇博文将对 kubelet 调用 CRI 删除 pod 做分析。

kubelet 中 CRI 相关的源码分析

kubelet 的 CRI 源码分析包括如下几部分：

（1）kubelet CRI 相关启动参数分析；

（2）kubelet CRI 相关 interface/struct 分析；

（3）kubelet CRI 初始化分析；

（4）kubelet 调用 CRI 创建 pod 分析；

（5）kubelet 调用 CRI 删除 pod 分析。

上两篇博文先对前四部分做了分析，本篇博文将对 kubelet 调用 CRI 删除 pod 做分析。

基于 tag v1.17.4

https://github.com/kubernetes/kubernetes/releases/tag/v1.17.4

5.kubelet 调用 CRI 删除 pod 分析

kubelet CRI 删除 pod 调用流程

下面以 kubelet dockershim 删除 pod 调用流程为例做一下分析。

kubelet 通过调用 dockershim 来停止容器，而 dockershim 则调用 docker 来停止容器，并调用 CNI 来删除 pod 网络。

图 1：kubelet dockershim 删除 pod 调用图示

dockershim 属于 kubelet 内置 CRI shim，其余 remote CRI shim 的创建 pod 调用流程其实与 dockershim 调用基本一致，只不过是调用了不同的容器引擎来操作容器，但一样由 CRI shim 调用 CNI 来删除 pod 网络。

下面进行详细的源码分析。

直接看到kubeGenericRuntimeManager的KillPod方法，调用 CRI 删除 pod 的逻辑将在该方法里触发发起。

从该方法代码也可以看出，kubelet 删除一个 pod 的逻辑为：

（1）先停止属于该 pod 的所有 containers；

（2）然后再停止 pod sandbox 容器。

注意点：这里只是停止容器，而删除容器的操作由 kubelet 的 gc 来做。

// pkg/kubelet/kuberuntime/kuberuntime_manager.go// KillPod kills all the containers of a pod. Pod may be nil, running pod must not be.// gracePeriodOverride if specified allows the caller to override the pod default grace period.// only hard kill paths are allowed to specify a gracePeriodOverride in the kubelet in order to not corrupt user data.// it is useful when doing SIGKILL for hard eviction scenarios, or max grace period during soft eviction scenarios.func (m *kubeGenericRuntimeManager) KillPod(pod *v1.Pod, runningPod kubecontainer.Pod, gracePeriodOverride *int64) error {  err := m.killPodWithSyncResult(pod, runningPod, gracePeriodOverride)  return err.Error()}
// killPodWithSyncResult kills a runningPod and returns SyncResult.// Note: The pod passed in could be *nil* when kubelet restarted.func (m *kubeGenericRuntimeManager) killPodWithSyncResult(pod *v1.Pod, runningPod kubecontainer.Pod, gracePeriodOverride *int64) (result kubecontainer.PodSyncResult) {  killContainerResults := m.killContainersWithSyncResult(pod, runningPod, gracePeriodOverride)  for _, containerResult := range killContainerResults {    result.AddSyncResult(containerResult)  }
  // stop sandbox, the sandbox will be removed in GarbageCollect  killSandboxResult := kubecontainer.NewSyncResult(kubecontainer.KillPodSandbox, runningPod.ID)  result.AddSyncResult(killSandboxResult)  // Stop all sandboxes belongs to same pod  for _, podSandbox := range runningPod.Sandboxes {    if err := m.runtimeService.StopPodSandbox(podSandbox.ID.ID); err != nil {      killSandboxResult.Fail(kubecontainer.ErrKillPodSandbox, err.Error())      klog.Errorf("Failed to stop sandbox %q", podSandbox.ID)    }  }
  return}

复制代码

5.1 m.killContainersWithSyncResult

m.killContainersWithSyncResult 作用：停止属于该 pod 的所有 containers。

主要逻辑：起与容器数量相同的 goroutine，调用m.killContainer来停止容器。

// pkg/kubelet/kuberuntime/kuberuntime_container.go// killContainersWithSyncResult kills all pod's containers with sync results.func (m *kubeGenericRuntimeManager) killContainersWithSyncResult(pod *v1.Pod, runningPod kubecontainer.Pod, gracePeriodOverride *int64) (syncResults []*kubecontainer.SyncResult) {  containerResults := make(chan *kubecontainer.SyncResult, len(runningPod.Containers))  wg := sync.WaitGroup{}
  wg.Add(len(runningPod.Containers))  for _, container := range runningPod.Containers {    go func(container *kubecontainer.Container) {      defer utilruntime.HandleCrash()      defer wg.Done()
      killContainerResult := kubecontainer.NewSyncResult(kubecontainer.KillContainer, container.Name)      if err := m.killContainer(pod, container.ID, container.Name, "", gracePeriodOverride); err != nil {        killContainerResult.Fail(kubecontainer.ErrKillContainer, err.Error())      }      containerResults <- killContainerResult    }(container)  }  wg.Wait()  close(containerResults)
  for containerResult := range containerResults {    syncResults = append(syncResults, containerResult)  }  return}

复制代码

5.1.1 m.killContainer

m.killContainer 方法主要是调用m.runtimeService.StopContainer。

runtimeService 即 RemoteRuntimeService，实现了 CRI shim 客户端-容器运行时接口RuntimeService interface，持有与 CRI shim 容器运行时服务端通信的客户端。所以调用m.runtimeService.StopContainer，实际上等于调用了 CRI shim 服务端的StopContainer方法，来进行容器的停止操作。

// pkg/kubelet/kuberuntime/kuberuntime_container.go// killContainer kills a container through the following steps:// * Run the pre-stop lifecycle hooks (if applicable).// * Stop the container.func (m *kubeGenericRuntimeManager) killContainer(pod *v1.Pod, containerID kubecontainer.ContainerID, containerName string, message string, gracePeriodOverride *int64) error {  ...
  klog.V(2).Infof("Killing container %q with %d second grace period", containerID.String(), gracePeriod)
  err := m.runtimeService.StopContainer(containerID.ID, gracePeriod)  if err != nil {    klog.Errorf("Container %q termination failed with gracePeriod %d: %v", containerID.String(), gracePeriod, err)  } else {    klog.V(3).Infof("Container %q exited normally", containerID.String())  }
  m.containerRefManager.ClearRef(containerID)
  return err}

复制代码

m.runtimeService.StopContainer

m.runtimeService.StopContainer 方法，会调用r.runtimeClient.StopContainer，即利用 CRI shim 客户端，调用 CRI shim 服务端来进行停止容器的操作。

分析到这里，kubelet 中的 CRI 相关调用就分析完毕了，接下来将会进入到 CRI shim（以 kubelet 内置 CRI shim-dockershim 为例）里进行停止容器的操作分析。

// pkg/kubelet/remote/remote_runtime.go// StopContainer stops a running container with a grace period (i.e., timeout).func (r *RemoteRuntimeService) StopContainer(containerID string, timeout int64) error {  // Use timeout + default timeout (2 minutes) as timeout to leave extra time  // for SIGKILL container and request latency.  t := r.timeout + time.Duration(timeout)*time.Second  ctx, cancel := getContextWithTimeout(t)  defer cancel()
  r.logReduction.ClearID(containerID)  _, err := r.runtimeClient.StopContainer(ctx, &runtimeapi.StopContainerRequest{    ContainerId: containerID,    Timeout:     timeout,  })  if err != nil {    klog.Errorf("StopContainer %q from runtime service failed: %v", containerID, err)    return err  }
  return nil}

复制代码

5.1.2 r.runtimeClient.StopContainer

接下来将会以 dockershim 为例，进入到 CRI shim 来进行停止容器操作的分析。

前面 kubelet 调用r.runtimeClient.StopContainer，会进入到 dockershim 下面的StopContainer方法。

// pkg/kubelet/dockershim/docker_container.go// StopContainer stops a running container with a grace period (i.e., timeout).func (ds *dockerService) StopContainer(_ context.Context, r *runtimeapi.StopContainerRequest) (*runtimeapi.StopContainerResponse, error) {  err := ds.client.StopContainer(r.ContainerId, time.Duration(r.Timeout)*time.Second)  if err != nil {    return nil, err  }  return &runtimeapi.StopContainerResponse{}, nil}

复制代码

ds.client.StopContainer

主要是调用d.client.ContainerStop。

// pkg/kubelet/dockershim/libdocker/kube_docker_client.go// Stopping an already stopped container will not cause an error in dockerapi.func (d *kubeDockerClient) StopContainer(id string, timeout time.Duration) error {  ctx, cancel := d.getCustomTimeoutContext(timeout)  defer cancel()  err := d.client.ContainerStop(ctx, id, &timeout)  if ctxErr := contextError(ctx); ctxErr != nil {    return ctxErr  }  return err}

复制代码

d.client.ContainerStop

构建请求参数，向 docker 指定的 url 发送 http 请求，停止容器。

// vendor/github.com/docker/docker/client/container_stop.go// ContainerStop stops a container. In case the container fails to stop// gracefully within a time frame specified by the timeout argument,// it is forcefully terminated (killed).//// If the timeout is nil, the container's StopTimeout value is used, if set,// otherwise the engine default. A negative timeout value can be specified,// meaning no timeout, i.e. no forceful termination is performed.func (cli *Client) ContainerStop(ctx context.Context, containerID string, timeout *time.Duration) error {  query := url.Values{}  if timeout != nil {    query.Set("t", timetypes.DurationToSecondsString(*timeout))  }  resp, err := cli.post(ctx, "/containers/"+containerID+"/stop", query, nil, nil)  ensureReaderClosed(resp)  return err}

复制代码

5.2 m.runtimeService.StopPodSandbox

在m.runtimeService.StopPodSandbox中的 runtimeService 即 RemoteRuntimeService，其实现了 CRI shim 客户端-容器运行时接口RuntimeService interface，持有与 CRI shim 容器运行时服务端通信的客户端。所以调用m.runtimeService.StopPodSandbox，实际上等于调用了 CRI shim 服务端的StopPodSandbox方法，来进行 pod sandbox 的停止操作。

分析到这里，kubelet 中的 CRI 相关调用就分析完毕了，接下来将会进入到 CRI shim（以 kubelet 内置 CRI shim-dockershim 为例）里进行停止 pod sandbox 的分析。

// pkg/kubelet/remote/remote_runtime.go// StopPodSandbox stops the sandbox. If there are any running containers in the// sandbox, they should be forced to termination.func (r *RemoteRuntimeService) StopPodSandbox(podSandBoxID string) error {  ctx, cancel := getContextWithTimeout(r.timeout)  defer cancel()
  _, err := r.runtimeClient.StopPodSandbox(ctx, &runtimeapi.StopPodSandboxRequest{    PodSandboxId: podSandBoxID,  })  if err != nil {    klog.Errorf("StopPodSandbox %q from runtime service failed: %v", podSandBoxID, err)    return err  }
  return nil}

复制代码

5.2.1 r.runtimeClient.StopPodSandbox

接下来将会以 dockershim 为例，进入到 CRI shim 来进行停止 pod sandbox 的分析。

前面 kubelet 调用r.runtimeClient.StopPodSandbox，会进入到 dockershim 下面的StopPodSandbox方法。

停止 pod sandbox 主要有 2 个步骤：

（1）调用ds.network.TearDownPod：删除 pod 网络；

（2）调用ds.client.StopContainer：停止 pod sandbox 容器。

需要注意的是，上面的 2 个步骤只有都成功了，停止 pod sandbox 的操作才算成功，且上面 2 个步骤成功的先后顺序没有要求。

// pkg/kubelet/dockershim/docker_sandbox.go// StopPodSandbox stops the sandbox. If there are any running containers in the// sandbox, they should be force terminated.// TODO: This function blocks sandbox teardown on networking teardown. Is it// better to cut our losses assuming an out of band GC routine will cleanup// after us?func (ds *dockerService) StopPodSandbox(ctx context.Context, r *runtimeapi.StopPodSandboxRequest) (*runtimeapi.StopPodSandboxResponse, error) {  var namespace, name string  var hostNetwork bool
  podSandboxID := r.PodSandboxId  resp := &runtimeapi.StopPodSandboxResponse{}
  // Try to retrieve minimal sandbox information from docker daemon or sandbox checkpoint.  inspectResult, metadata, statusErr := ds.getPodSandboxDetails(podSandboxID)  if statusErr == nil {    namespace = metadata.Namespace    name = metadata.Name    hostNetwork = (networkNamespaceMode(inspectResult) == runtimeapi.NamespaceMode_NODE)  } else {    checkpoint := NewPodSandboxCheckpoint("", "", &CheckpointData{})    checkpointErr := ds.checkpointManager.GetCheckpoint(podSandboxID, checkpoint)
    // Proceed if both sandbox container and checkpoint could not be found. This means that following    // actions will only have sandbox ID and not have pod namespace and name information.    // Return error if encounter any unexpected error.    if checkpointErr != nil {      if checkpointErr != errors.ErrCheckpointNotFound {        err := ds.checkpointManager.RemoveCheckpoint(podSandboxID)        if err != nil {          klog.Errorf("Failed to delete corrupt checkpoint for sandbox %q: %v", podSandboxID, err)        }      }      if libdocker.IsContainerNotFoundError(statusErr) {        klog.Warningf("Both sandbox container and checkpoint for id %q could not be found. "+          "Proceed without further sandbox information.", podSandboxID)      } else {        return nil, utilerrors.NewAggregate([]error{          fmt.Errorf("failed to get checkpoint for sandbox %q: %v", podSandboxID, checkpointErr),          fmt.Errorf("failed to get sandbox status: %v", statusErr)})      }    } else {      _, name, namespace, _, hostNetwork = checkpoint.GetData()    }  }
  // WARNING: The following operations made the following assumption:  // 1. kubelet will retry on any error returned by StopPodSandbox.  // 2. tearing down network and stopping sandbox container can succeed in any sequence.  // This depends on the implementation detail of network plugin and proper error handling.  // For kubenet, if tearing down network failed and sandbox container is stopped, kubelet  // will retry. On retry, kubenet will not be able to retrieve network namespace of the sandbox  // since it is stopped. With empty network namespcae, CNI bridge plugin will conduct best  // effort clean up and will not return error.  errList := []error{}  ready, ok := ds.getNetworkReady(podSandboxID)  if !hostNetwork && (ready || !ok) {    // Only tear down the pod network if we haven't done so already    cID := kubecontainer.BuildContainerID(runtimeName, podSandboxID)    err := ds.network.TearDownPod(namespace, name, cID)    if err == nil {      ds.setNetworkReady(podSandboxID, false)    } else {      errList = append(errList, err)    }  }  if err := ds.client.StopContainer(podSandboxID, defaultSandboxGracePeriod); err != nil {    // Do not return error if the container does not exist    if !libdocker.IsContainerNotFoundError(err) {      klog.Errorf("Failed to stop sandbox %q: %v", podSandboxID, err)      errList = append(errList, err)    } else {      // remove the checkpoint for any sandbox that is not found in the runtime      ds.checkpointManager.RemoveCheckpoint(podSandboxID)    }  }
  if len(errList) == 0 {    return resp, nil  }
  // TODO: Stop all running containers in the sandbox.  return nil, utilerrors.NewAggregate(errList)}

复制代码

ds.client.StopContainer

主要是调用d.client.ContainerStop。

// pkg/kubelet/dockershim/libdocker/kube_docker_client.go// Stopping an already stopped container will not cause an error in dockerapi.func (d *kubeDockerClient) StopContainer(id string, timeout time.Duration) error {  ctx, cancel := d.getCustomTimeoutContext(timeout)  defer cancel()  err := d.client.ContainerStop(ctx, id, &timeout)  if ctxErr := contextError(ctx); ctxErr != nil {    return ctxErr  }  return err}

复制代码

d.client.ContainerStop

构建请求参数，向 docker 指定的 url 发送 http 请求，停止 pod sandbox 容器。

// vendor/github.com/docker/docker/client/container_stop.go// ContainerStop stops a container. In case the container fails to stop// gracefully within a time frame specified by the timeout argument,// it is forcefully terminated (killed).//// If the timeout is nil, the container's StopTimeout value is used, if set,// otherwise the engine default. A negative timeout value can be specified,// meaning no timeout, i.e. no forceful termination is performed.func (cli *Client) ContainerStop(ctx context.Context, containerID string, timeout *time.Duration) error {  query := url.Values{}  if timeout != nil {    query.Set("t", timetypes.DurationToSecondsString(*timeout))  }  resp, err := cli.post(ctx, "/containers/"+containerID+"/stop", query, nil, nil)  ensureReaderClosed(resp)  return err}

复制代码

总结

CRI 架构图

在 CRI 之下，包括两种类型的容器运行时的实现：

（1）kubelet 内置的 dockershim，实现了 Docker 容器引擎的支持以及 CNI 网络插件（包括 kubenet）的支持。dockershim代码内置于 kubelet，被 kubelet 调用，让dockershim起独立的 server 来建立 CRI shim，向 kubelet 暴露 grpc server；

（2）外部的容器运行时，用来支持 rkt、containerd 等容器引擎的外部容器运行时。

kubelet 调用 CRI 删除 pod 流程分析

kubelet 删除一个 pod 的逻辑为：

（1）先停止属于该 pod 的所有 containers；

（2）然后再停止 pod sandbox 容器（包括删除 pod 网络）。

注意点：这里只是停止容器，而删除容器的操作由 kubelet 的 gc 来做。

kubelet CRI 删除 pod 调用流程

下面以 kubelet dockershim 删除 pod 调用流程为例做一下分析。

kubelet 通过调用 dockershim 来停止容器，而 dockershim 则调用 docker 来停止容器，并调用 CNI 来删除 pod 网络。

图 1：kubelet dockershim 删除 pod 调用图示

关联博客：《kubernetes/k8s CRI 分析-容器运行时接口分析》https://xie.infoq.cn/article/0d2bc16c22af9fc710b9f9c30

《kubernetes/k8s CRI 分析-kubelet 创建 pod 分析》https://xie.infoq.cn/article/d86b446f69ed5ab769aa6d3c9

发布于: 1 小时前阅读数: 2

原文链接:【http://xie.infoq.cn/article/eb4bd8baede86822af16f912c】。文章转载请联系作者。

良凯尔

关注

热爱的力量 2020.01.10 加入

kubernetes开发者

发布

暂无评论

创作场景

kubernetes/k8s CRI 分析 -kubelet 删除 pod 分析

kubelet 中 CRI 相关的源码分析

基于 tag v1.17.4

5.kubelet 调用 CRI 删除 pod 分析

kubelet CRI 删除 pod 调用流程

5.1 m.killContainersWithSyncResult

5.1.1 m.killContainer

5.1.2 r.runtimeClient.StopContainer

5.2 m.runtimeService.StopPodSandbox

5.2.1 r.runtimeClient.StopPodSandbox

总结

CRI 架构图

kubelet 调用 CRI 删除 pod 流程分析

kubelet CRI 删除 pod 调用流程

良凯尔

评论