cri-o 技术探秘 1

关注

发布于: 2021 年 04 月 18 日

概述：

docker 是容器化技术的代表，但是在容器化技术浪潮中，出了 docker 之外也涌现了不少开源产品。cri-o 就是其中的代表。cri-o 在 2019 年成为 CNCF 孵化项目(https://www.cncf.io/blog/2019/04/08/cncf-to-host-cri-o/)。cri-o 实现了 Kubernetes 中的Container Runtime Interface(CRI)接口。能够和 Kubernetes 无缝集成，可以在生产环境中替换掉 docker。和 docker 有所不同的是，cri-o 焦点并不是实现所有的 docker 功能，而是聚焦于在 Kubernetes 中的高效快速稳定运行，它不会涉及 build image，存储 image 等和核心功能外的部分。我们仍然可以使用 docker，podman 等工具来完成这部分工作。

架构:

总体架构上， cri-o 对外提供 gRPC API 供 kubelet 调用来完成镜像的下载，容器的配置运行等工作。内部使用了一些成熟的开源组件来完成一些一些通用的工作,而不是重复造轮子。比如下面这些组件：

https://github.com/containers/storage
https://github.com/containers/image
...

通过封装已有的开源组件，cri-o 实现了构成 cri-o 的各种组件，主要有：

image service 提供了管理 image 的功能。实现了下面的接口：‍

internal/storage/image.go:114// ImageServer wraps up various CRI-related activities into a reusable// implementation.type ImageServer interface {	// ListImages returns list of all images which match the filter.	ListImages(systemContext *types.SystemContext, filter string) ([]ImageResult, error)	// ImageStatus returns status of an image which matches the filter.	ImageStatus(systemContext *types.SystemContext, filter string) (*ImageResult, error)	// PrepareImage returns an Image where the config digest can be grabbed	// for further analysis. Call Close() on the resulting image.	PrepareImage(systemContext *types.SystemContext, imageName string) (types.ImageCloser, error)	// PullImage imports an image from the specified location.	PullImage(systemContext *types.SystemContext, imageName string, options *ImageCopyOptions) (types.ImageReference, error)	// UntagImage removes a name from the specified image, and if it was	// the only name the image had, removes the image.	UntagImage(systemContext *types.SystemContext, imageName string) error	// GetStore returns the reference to the storage library Store which	// the image server uses to hold images, and is the destination used	// when it's asked to pull an image.	GetStore() storage.Store	// ResolveNames takes an image reference and if it's unqualified (w/o hostname),	// it uses crio's default registries to qualify it.	ResolveNames(systemContext *types.SystemContext, imageName string) ([]string, error)}

复制代码

runtime service 提供了管理 container 的功能，实现了下面的接口：

runtime service提供了管理container的功能，实现了下面的接口：internal/storage/runtime.go:59// RuntimeServer wraps up various CRI-related activities into a reusable// implementation.type RuntimeServer interface {	// CreatePodSandbox creates a pod infrastructure container, using the	// specified PodID for the infrastructure container's ID.  In the CRI	// view of things, a sandbox is distinct from its containers, including	// its infrastructure container, but at this level the sandbox is	// essentially the same as its infrastructure container, with a	// container's membership in a pod being signified by it listing the	// same pod ID in its metadata that the pod's other members do, and	// with the pod's infrastructure container having the same value for	// both its pod's ID and its container ID.	// Pointer arguments can be nil.  Either the image name or ID can be	// omitted, but not both.  All other arguments are required.	CreatePodSandbox(systemContext *types.SystemContext, podName, podID, imageName, imageAuthFile, imageID, containerName, metadataName, uid, namespace string, attempt uint32, idMappingsOptions *storage.IDMappingOptions, labelOptions []string, privileged bool) (ContainerInfo, error)	// RemovePodSandbox deletes a pod sandbox's infrastructure container.	// The CRI expects that a sandbox can't be removed unless its only	// container is its infrastructure container, but we don't enforce that	// here, since we're just keeping track of it for higher level APIs.	RemovePodSandbox(idOrName string) error
	// GetContainerMetadata returns the metadata we've stored for a container.	GetContainerMetadata(idOrName string) (RuntimeContainerMetadata, error)	// SetContainerMetadata updates the metadata we've stored for a container.	SetContainerMetadata(idOrName string, metadata *RuntimeContainerMetadata) error
	// CreateContainer creates a container with the specified ID.	// Pointer arguments can be nil.  Either the image name or ID can be	// omitted, but not both.  All other arguments are required.	CreateContainer(systemContext *types.SystemContext, podName, podID, imageName, imageID, containerName, containerID, metadataName string, attempt uint32, idMappingsOptions *storage.IDMappingOptions, labelOptions []string, privileged bool) (ContainerInfo, error)	// DeleteContainer deletes a container, unmounting it first if need be.	DeleteContainer(idOrName string) error
	// StartContainer makes sure a container's filesystem is mounted, and	// returns the location of its root filesystem, which is not guaranteed	// by lower-level drivers to never change.	StartContainer(idOrName string) (string, error)	// StopContainer attempts to unmount a container's root filesystem,	// freeing up any kernel resources which may be limited.	StopContainer(idOrName string) error
	// GetWorkDir returns the path of a nonvolatile directory on the	// filesystem (somewhere under the Store's Root directory) which can be	// used to store arbitrary data that is specific to the container.  It	// will be removed automatically when the container is deleted.	GetWorkDir(id string) (string, error)	// GetRunDir returns the path of a volatile directory (does not survive	// the host rebooting, somewhere under the Store's RunRoot directory)	// on the filesystem which can be used to store arbitrary data that is	// specific to the container.  It will be removed automatically when	// the container is deleted.	GetRunDir(id string) (string, error)}

复制代码

OCI generate 是指 github.com/opencontainers/runtime-tools/generate/generate.go:29 里面定义的数据结构。 OCI 是指Open Container Initiative标准。这个 generate 是能够生成符合 OCI 标准的描述 container 运行时的配置文件。
CNI 是指Container Network Interface. cri-o 通过 CNI 接口来使用各种网络插件。
conmon在 cri-o 和 OCI runtime（runc or crun)之间，当我们创建 container 的时候，会通过 conmon 来 fork container 进程。这样即使 cri-o 进程由于各种原因死掉了，它启动的 container 进程也不会死掉。

从架构图中可以看到， cri-o container runtime 部分遵循了 Kubernetes 的设计，使用了 pod 的概念。在 pod 内部可以运行一个或者多个 container 进程。当创建 pod 的时候，会使用一个 infra container（也有地方成为 pause container）来占用 network namespace。

下面这张图能能形象的描述 container 创建的过程。

Kubelet 调用 CRI 接口。
Kubernetes 配置了使用 cri-o，所以会调用 cri-o daemon 进程的 gRPC 接口。
cri-o daemon 使用/container/image 和/container/storage 来拉取 image 以及管理 disk 中的 image。
cri-o daemon 使用/container/image 从 registry 中拉取 image。
cri-o 兼容的 runtime（一般是 runc）来启动 container 进程。
runc 通过 image 中的文件以及相关配置告诉 linux kernel 在合适的 namespace 以及 cgroup 中启动 container 进程。

主要技术点：

在容器化技术中，一般会提到两个概念， cgroup 和 namespace。这两个东西成为了容器技术的基石。cgroup是 liunx 中的控制进程对资源(cpu, 内存等)使用的技术。namespace则可以把进程分成若干组，使得组之间的进程不能相互感知到。下面我们通过 cri-o 来了解 cri-o 是如何和 cgroup 以及 namespace 进行结合的。

除了 cgroup 和 namespace 外，网络也是容器技术中的重要一环，它决定了我们容器之间是如何通信的。

Cgroup:

在 cri-o 的配置文件中，有三个配置和 cgroup 密切相关。我们分别来看一看：

# Cgroup management implementation used for the runtime.cgroup_manager = "systemd"
# Specify whether the image pull must be performed in a separate cgroup.separate_pull_cgroup = ""
# infra_ctr_cpuset determines what CPUs will be used to run infra containers.# You can use linux CPU list format to specify desired CPUs.# To get better isolation for guaranteed pods, set this parameter to be equal to kubelet reserved-cpus.infra_ctr_cpuset = ""

复制代码

1. cgroup_manager:

在 cri-o 中，有两种 Cgroup manager(也有地方乘坐 cgroup driver)的实现，分别是：

systemd
cgroupfs

internal/config/cgmgr/cgmgr.go:80// SetCgroupManager takes a string and branches on it to return// the type of cgroup manager configuredfunc SetCgroupManager(cgroupManager string) (CgroupManager, error) {	switch cgroupManager {	case systemdCgroupManager:		systemdMgr := SystemdManager{			memoryPath:    cgroupMemoryPathV1,			memoryMaxFile: cgroupMemoryMaxFileV1,		}		if node.CgroupIsV2() {			systemdMgr.memoryPath = cgroupMemoryPathV2			systemdMgr.memoryMaxFile = cgroupMemoryMaxFileV2		}		return &systemdMgr, nil	case cgroupfsCgroupManager:		cgroupfsMgr := CgroupfsManager{			memoryPath:    cgroupMemoryPathV1,			memoryMaxFile: cgroupMemoryMaxFileV1,		}		if node.CgroupIsV2() {			cgroupfsMgr.memoryPath = cgroupMemoryPathV2			cgroupfsMgr.memoryMaxFile = cgroupMemoryMaxFileV2		}		return &cgroupfsMgr, nil	default:		return nil, errors.Errorf("invalid cgroup manager: %s", cgroupManager)	}}

复制代码

这两者是 runc 支持的 cgroup driver. 根据Kubnetes文档，如果在使用了 systemd 作为 init 工具的 liunx 发行版中，使用 cgroupfs 作为 cgroup driver，那么可能造成在一个 node 上面同时使用两个 cgroup driver 的情况，在这种情况下，对各种资源的使用情况就会有两个视角，这样可能会导致 Kubenetes 系统不稳定。考虑到目前大部分 liunx 发行版都是使用 systemd，那么在 cri-o 也是首选使用 systemd。因此在 cri-o 中，默认也使用 systemd 作为 cgroup driver。

2. separate_pull_cgroup:

separate_pull_cgroup 这个参数比较有意思，它是说当我们在 pull image 的时候，pull image 的时候，这个进程是不是应该放到某个 cgroup 下进行操作。有三种取值：

空字符串，代表 pull 不会将 pull 操作纳入到某个 cgroup 下进行。
pod，代表 pull image 操作纳入到将要启动的 container 的 cgroup 中运行。
具体 cgroup 的名字。代表将 pull image 操作纳入到一个固定的 cgroup 中进行操作。

另外，需要注意的是，这个参数只有在使用 systemd 作为 cgroup driver 的时候，才能使用。如果不是 systemd，则会出现错误"--separate-pull-cgroup is supported only with systemd"

// pullImage performs the actual pull operation of PullImage. Used to separate// the pull implementation from the pullCache logic in PullImage and improve// readability and maintainability.func (s *Server) pullImage(ctx context.Context, pullArgs *pullArguments) (string, error) {  ...      cgroup := ""
		if s.config.SeparatePullCgroup != "" {			if !s.config.CgroupManager().IsSystemd() {				return "", errors.New("--separate-pull-cgroup is supported only with systemd")			}			if s.config.SeparatePullCgroup == "pod" {				cgroup = pullArgs.sandboxCgroup			} else {				cgroup = s.config.SeparatePullCgroup				if !strings.Contains(cgroup, ".slice") {					return "", fmt.Errorf("invalid systemd cgroup %q", cgroup)				}			}		}      ...    }

复制代码

3. infra_ctr_cpuset

infra_ctr_cpuset 这个设置 j 决定了 infra container 使用哪些 cpu 核心去运行。可以使用 liunx CPU list format 来定义具体到底是哪些 cpu 核心。比如：0-4,9,表示 0，1，2，3，4，9 这几个核心。对于有低延迟要求的 pod，通过将 infra container 的 cpuset 和其他 container 隔离开来，这样能减少由于中断等造成工作 container 延迟增加的情况。通过代码可见，当我们创建 sandbox（即 pod)的时候，如果这个参数有值，则会将这个值放到 infra container 的 generator 中。generator 最后会将相关配置生成为可供 runc 运行 container 的配置文件 config.json。

server/sandbox_run_linux.go:288func (s *Server) runPodSandbox(ctx context.Context, req *types.RunPodSandboxRequest) (resp *types.RunPodSandboxResponse, retErr error) {...// When infra-ctr-cpuset specified, set the infra container CPU set	if s.config.InfraCtrCPUSet != "" {		log.Debugf(ctx, "Set the infra container cpuset to %q", s.config.InfraCtrCPUSet)		g.SetLinuxResourcesCPUCpus(s.config.InfraCtrCPUSet)	} ...}

复制代码

Namespace:

当我们创建 pod 的时候，就会调用 configureGeneratorForSandboxNamespaces 这个函数来配置 namespace 相关信息。目前 cri-o 设计了四种 Namespace mode，即 POD/CONTAINER/NODE/TARGET.目前只有 POD 和 NODE 被使用。POD 就是说在同一个 pod 内的 container 共用同一个 namespace，NODE 是指不需要创建 namespace，在 host 内就可以看到 Pod/container 内部。所以我们可以看到 configureGeneratorForSandboxNamespaces 刚开始就开始判断是不是使用 HOST mode，如果是，则一出 namespace。如果使用了 POD 模式，那么则篡了一个 namespace 相关的配置交给 NewPodNamespaces 函数，我们可以在 NewPodNamespaces 实现看到，创建 namespace 使用过调用 pinns 命令来完成的。pinns 是 cri-o 内部的一个组件，使用 C 语言编写，位于 pinns 文件夹，主要逻辑是使用 unshare 系统调用来创建 namespace。

server/cri/types/types.go:9  NamespaceModePOD       NamespaceMode = 0	NamespaceModeCONTAINER NamespaceMode = 1	NamespaceModeNODE      NamespaceMode = 2	NamespaceModeTARGET    NamespaceMode = 3
server/sandbox_run_linux.go:1003// configureGeneratorForSandboxNamespaces set the linux namespaces for the generator, based on whether the pod is sharing namespaces with the host,// as well as whether CRI-O should be managing the namespace lifecycle.// it returns a slice of cleanup funcs, all of which are the respective NamespaceRemove() for the sandbox.// The caller should defer the cleanup funcs if there is an error, to make sure each namespace we are managing is properly cleaned up.func (s *Server) configureGeneratorForSandboxNamespaces(hostNetwork, hostIPC, hostPID bool, idMappings *idtools.IDMappings, sysctls map[string]string, sb *libsandbox.Sandbox, g *generate.Generator) (cleanupFuncs []func() error, retErr error) {	// Since we need a process to hold open the PID namespace, CRI-O can't manage the NS lifecycle	if hostPID {		if err := g.RemoveLinuxNamespace(string(spec.PIDNamespace)); err != nil {			return nil, err		}	}	namespaceConfig := &nsmgr.PodNamespacesConfig{		Sysctls:    sysctls,		IDMappings: idMappings,		Namespaces: []*nsmgr.PodNamespaceConfig{			{				Type: nsmgr.IPCNS,				Host: hostIPC,			},			{				Type: nsmgr.NETNS,				Host: hostNetwork,			},			{				Type: nsmgr.UTSNS, // there is no option for host UTSNS			},		},	}	if idMappings != nil {		namespaceConfig.Namespaces = append(namespaceConfig.Namespaces, &nsmgr.PodNamespaceConfig{			Type: nsmgr.USERNS,		})	}
	// now that we've configured the namespaces we're sharing, create them    namespaces, err := s.config.NamespaceManager().NewPodNamespaces(namespaceConfig)	if err != nil {		return nil, err	}
	sb.AddManagedNamespaces(namespaces)
	cleanupFuncs = append(cleanupFuncs, sb.RemoveManagedNamespaces)
	if err := configureGeneratorGivenNamespacePaths(sb.NamespacePaths(), g); err != nil {		return cleanupFuncs, err	}
	return cleanupFuncs, nil}

复制代码

网络:

在 namespace 中，有一种 namespace 便是网络，基于此，也就诞生了网络虚拟化的概念，在容器网络里面，我们可以使用代码来表示“网线”或者“交换机”等网络设备，虽然这些设备都是虚拟的，但是里面的概念和物理网络是有着几分相似的。

1. CNI:

容器网络里面，我们可定要讲讲CNI接口，CNI 是 Kubenetes 定义的一组接口，用来定义对虚拟网络的操作，这样不同的网络插件就可以使用 CNI 来和 Kubenetes 进行交互了。CNI 接口主要定义了 Pod 和 Pod 之间如何建立/删除网路连接，如何管理 IP 等，在此不再细讲。

vendor/github.com/cri-o/ocicni/pkg/ocicni/types.go:118// CNIPlugin is the interface that needs to be implemented by a plugintype CNIPlugin interface {   // Name returns the plugin's name. This will be used when searching   // for a plugin by name, e.g.   Name() string
   // GetDefaultNetworkName returns the name of the plugin's default   // network.   GetDefaultNetworkName() string
   // SetUpPod is the method called after the sandbox container of   // the pod has been created but before the other containers of the   // pod are launched.   SetUpPod(network PodNetwork) ([]NetResult, error)
   // SetUpPodWithContext is the same as SetUpPod but takes a context   SetUpPodWithContext(ctx context.Context, network PodNetwork) ([]NetResult, error)
   // TearDownPod is the method called before a pod's sandbox container will be deleted   TearDownPod(network PodNetwork) error
   // TearDownPodWithContext is the same as TearDownPod but takes a context   TearDownPodWithContext(ctx context.Context, network PodNetwork) error
   // GetPodNetworkStatus is the method called to obtain the ipv4 or ipv6 addresses of the pod sandbox   GetPodNetworkStatus(network PodNetwork) ([]NetResult, error)
   // GetPodNetworkStatusWithContext is the same as GetPodNetworkStatus but takes a context   GetPodNetworkStatusWithContext(ctx context.Context, network PodNetwork) ([]NetResult, error)
   // NetworkStatus returns error if the network plugin is in error state   Status() error
   // Shutdown terminates all driver operations   Shutdown() error}

复制代码

2. HostPortManager:

顾名思义，HostPortManage 的作用就是在 host 机器上和 pod 之间建立 port 映射，当我们访问 host 的某一个端口的时候，实际上是 pod 内某一个 container 进程监听的另一个端口来处理的。类似的组件也被称为网桥，所以操作也就只有两个了，一个 add，就是建桥；一个 remove，就是拆桥。

cri-o 中有一张图来解释 hostport。

在 Add 方法中，有一步是试图打开 host 的 port。实际上如果我们不打开 host port 也是没有问题的，因为我们是通过 iptables 来控制流量的转向的。比如我们有一个 telnet 进程访问 host port，那么在流量进出的时候，所有的包都会在经过 iptables 设置的 nat（查询 conntrack）转换，实际上包并没有经过 host port，而是转到 pod 的 ip+port 上了那么 open host port 是不是就没有必要？不是的，我们知道在 host 上有不少的进程会使用 port，比如 sshd，如果这个时候我们把 22 端口和 pod 里面的 22 端口 map 到一块了，如果我们不试图 open host 22 端口，而是直接改 iptables，那么久有可能导致这个 node 的 ssh 连接由于 iptables 的更改使得 ssh 连接中断，这不是我们期望的。当我们试图打开 22 端口的时候，由于 22 端口已经被占用了，所以直接回报错，从而避免了由于配置不当导致更严重的问题发生。示意图如下：

internal/hostport/hostport_manager.go:76func (hm *hostportManager) Add(id string, podPortMapping *PodPortMapping, natInterfaceName string) (err error) {    ...    // try to open hostports    ports, err := hm.openHostports(podPortMapping)    if err != nil {       return err    }    ...}

复制代码

总结：

本文主要讲解了 cri-o 的架构和主要技术点，并没有涉及很具体的细节，随着对 cri-o 的代码的深入研究，以后会提供更多的相关给大家，一块学习，一块进步。

发布于: 2021 年 04 月 18 日阅读数: 113

原文链接:【http://xie.infoq.cn/article/a9fef4615c17268cefdd12f76】。

xumc

关注

golang攻城狮 2017.12.15 加入

FreeWheel 码农一枚

评论 (1 条评论)

发布

少海

沙发

2021 年 04 月 21 日 10:21

 0 回复

没有更多了

创作场景

cri-o 技术探秘 1

概述：

架构:

主要技术点：

Cgroup:

1. cgroup_manager:

2. separate_pull_cgroup:

3. infra_ctr_cpuset

Namespace:

网络:

1. CNI:

2. HostPortManager:

总结：

xumc

评论 (1 条评论)