小试牛刀 - Kubernetes 上搭建 TiDB 集群

2024-10-25
北京
本文字数：17752 字
阅读完需：约 58 分钟

作者：数据源的 TiDB 学习之路原文来源：https://tidb.net/blog/21d5b7d0

TiDB Operator 是 Kubernetes 上的 TiDB 集群自动运维系统，提供包括部署、升级、扩缩容、备份恢复、配置变更的 TiDB 全生命周期管理。借助 TiDB Operator，TiDB 可以无缝运行在公有云或自托管的 Kubernetes 集群上。

TiDB Operator 提供了多种方式来部署 Kubernetes 上的 TiDB 集群，本文介绍如何在 Linux 测试环境中创建一个简单的 Kubernetes 集群，部署 TiDB Operator，并使用 TiDB Operator 部署 TiDB 集群。

1 创建 Kubernetes 集群

本文选择使用 kubeadm 部署本地测试 Kubernetes 集群，在使用 kubeadm 创建 Kubernetes 集群前，我们需要先安装好 containerd(容器运行时)、kubelet(以容器方式部署和启动 Kubernetes 的主要服务)、kubeadm(Kubernetes 部署工具) **** 和 kubectl(Kubernetes 命令行客户端)，另外需要将参数 net.ipv4.ip_forward 设置为 1。

安装 containerd

Kubernetes 自 1.24 版本后默认使用 CRI(Container Runtime Interface) 兼容的容器运行时来运行 Kubernetes 集群，如 containerd 或 CRI-O。当前环境默认安装 Kubernetes 版本为 1.31，容器运行时使用 containerd，安装步骤参考 https://github.com/containerd/containerd/blob/main/docs/getting-started.md

首先下载安装包，基于环境是 x86 还是 ARM 选择对应的版本，下载地址为：https://github.com/containerd/containerd/releases

##下载及安装 containerdtar -xzvf containerd-1.7.23-linux-arm64.tar.gzmv bin/* /usr/local/bin/
## 配置 systemd 启停mkdir /usr/local/lib/systemd/system -pcd /usr/local/lib/systemd/system
## 编辑 containerd.service, 内容参考 https://raw.githubusercontent.com/containerd/containerd/main/containerd.servicevi containerd.servicesystemctl daemon-reloadsystemctl enable --now containerd## 启动 containerd 服务systemctl restart containerd
## 下载并安装 runc,下载地址为 https://github.com/opencontainers/runc/releasesinstall -m 755 runc.arm64 /usr/local/sbin/runc

复制代码

设置 net.ipv4.ip_forward

临时修改使用命令 sysctl -w net.ipv4.ip_forward=1，如果要永久生效则需要将 net.ipv4.ip_forward=1 添加到 /etc/sysctl.conf 配置文件中并执行 sysctl -p 生效。

安装 kubelet、kubeadm 和 kubectl

添加 kubernetes.repo yum 源，如果下载较慢可自行选择一个国内 yum 源地址

# 此操作会覆盖 /etc/yum.repos.d/kubernetes.repo 中现存的所有配置cat <<EOF | sudo tee /etc/yum.repos.d/kubernetes.repo[kubernetes]name=Kubernetesbaseurl=https://pkgs.k8s.io/core:/stable:/v1.31/rpm/enabled=1gpgcheck=1gpgkey=https://pkgs.k8s.io/core:/stable:/v1.31/rpm/repodata/repomd.xml.keyexclude=kubelet kubeadm kubectl cri-tools kubernetes-cniEOF

复制代码

之后使用 yum install 进行安装，并启动 kubelet 服务及设置开机自动启动

yum install -y kubelet kubeadm kubectl --disableexcludes=kubernetessystemctl start kubeletsystemctl enable kubelet

复制代码

创建 Kubernetes 集群

创建 Kubernetes 集群主要有 2 个步骤：第 1 步是使用 kubeadm init 初始化控制平面，第 2 步则是使用 kubeadm join 添加工作节点。

首先选择一台节点作为控制平面节点，并在节点上使用 kubeadm init 初始化集群。初始化集群需要配置一些必要的参数，这可以通过两种方法实现，一种是编辑 init.yaml 文件并在命令行中通过 –config init.yaml **** 实现，另一种则是直接传递参数名 = 参数值实现。

通过 –config init.yaml 实现
通过指定参数名 = 参数值实现

上述参数 –image-repository registry.aliyuncs.com/google_containers 表示使用国内阿里云镜像托管站点下载镜像。由于默认 Kubernetes 服务指向的国外镜像下载地址是 registry.k8s.io 可能无法访问，因此需要修改为国内镜像地址。查看默认的镜像下载地址可以通过命令行 kubeadm config images list 实现，

## 查看默认的镜像下载地址kubeadm config images list==============================================================registry.k8s.io/kube-apiserver:v1.31.1registry.k8s.io/kube-controller-manager:v1.31.1registry.k8s.io/kube-scheduler:v1.31.1registry.k8s.io/kube-proxy:v1.31.1registry.k8s.io/coredns/coredns:v1.11.3registry.k8s.io/pause:3.10registry.k8s.io/etcd:3.5.15-0
## 指定--image-repository查看的镜像下载地址kubeadm config images list --image-repository registry.aliyuncs.com/google_containers==============================================================registry.aliyuncs.com/google_containers/kube-apiserver:v1.31.1registry.aliyuncs.com/google_containers/kube-controller-manager:v1.31.1registry.aliyuncs.com/google_containers/kube-scheduler:v1.31.1registry.aliyuncs.com/google_containers/kube-proxy:v1.31.1registry.aliyuncs.com/google_containers/coredns:v1.11.3registry.aliyuncs.com/google_containers/pause:3.10registry.aliyuncs.com/google_containers/etcd:3.5.15-0

复制代码

上述步骤执行可能会由于网络问题而导致镜像拉取失败，这可能是因为 containerd 未配置代理，我们需要根据问题列表 “执行 kubeadm init 报错 failed to pull image” 配置 containerd 代理。

同时，containerd 配置中默认的镜像下载地址仍然是 registry.k8s.io，我们需要将其修改为上述阿里云镜像下载地址，具体方法参数问题列表 *“执行 kubeadm init 报错 context deadline exceeded”*。

现在，重新执行 kubeadm init 便可以正常创建 Kubernetes 集群，输出提示显示 Your Kubernetes control-plane has initialized successfully! 表明控制平面已经初始化成功。

##kubeadm reset表示重置之前初始化的控制平面kubeadm reset
kubeadm init --apiserver-advertise-address=xx.xx.x.151 --image-repository registry.aliyuncs.com/google_containers --service-cidr=10.96.0.0/12 --pod-network-cidr=10.244.0.0/16========================================[init] Using Kubernetes version: v1.31.1。。。Your Kubernetes control-plane has initialized successfully!
To start using your cluster, you need to run the following as a regular user:
  mkdir -p $HOME/.kube  sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config  sudo chown $(id -u):$(id -g) $HOME/.kube/config
Alternatively, if you are the root user, you can run:
  export KUBECONFIG=/etc/kubernetes/admin.conf
You should now deploy a pod network to the cluster.Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:  https://kubernetes.io/docs/concepts/cluster-administration/addons/
Then you can join any number of worker nodes by running the following on each as root:
kubeadm join xx.xx.x.151:6443 --token hml2xs.agic16co7u1e8lki \        --discovery-token-ca-cert-hash sha256:570bd607f60eac2d4bde3416dc84ebf9736fd25f20874c293ed372dde2f82f61

复制代码

如果想继续使用集群，根据上述输出提示，根据执行用户是否为 root，需要执行相关步骤，root 用户只需要执行 export KUBECONFIG=/etc/kubernetes/admin.conf 即可。

kubectl -n kube-system get configmap====================================NAME                                                   DATA   AGEcoredns                                                1      7m5sextension-apiserver-authentication                     6      7m8skube-apiserver-legacy-service-account-token-tracking   1      7m8skube-proxy                                             2      7m5skube-root-ca.crt                                       1      7m1skubeadm-config                                         1      7m6skubelet-config                                         1      7m6s

复制代码

虽然，Kubernetes 集群中的控制平面已经创建成功，不过此时还没有可用的工作节点，并缺乏容器网络的配置。下一步我们为集群添加 Node 节点。

首先，使用上述相同的方式在新的节点上安装 containerd、kubeadm、kubelet，并启动 containerd、kubelet 服务。其次，使用 kubeadm join 命令将节点加入集群，可从上述 kubeadm init 输出中复制完整命令，如下所示：

kubeadm join xx.xx.x.151:6443 --token w6cvcg.or8m1644t6vxlzub \        --discovery-token-ca-cert-hash sha256:92793ee4cfd14610de745bc1a604557d54fd69fb2cd1dccc3cc6d24be74ff8cb

复制代码

注意，token 和 discovery-token-ca-cert-hash 要根据控制平面节点当前的实际值填写，否则可能会导致 join 报错，具体查看问题列表 “执行 kubeadm join 报错 couldn’t validate the identity of the API Server” 解决

[preflight] Running pre-flight checks[preflight] Reading configuration from the cluster...[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"[kubelet-start] Starting the kubelet[kubelet-check] Waiting for a healthy kubelet at http://127.0.0.1:10248/healthz. This can take up to 4m0s[kubelet-check] The kubelet is healthy after 501.642626ms[kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap
This node has joined the cluster:* Certificate signing request was sent to apiserver and a response was received.* The Kubelet was informed of the new secure connection details.
Run 'kubectl get nodes' on the control-plane to see this node join the cluster.

复制代码

上述输出证明 Node 已经被添加到 Kubernetes 集群中，同样的步骤在多个节点上执行就可以添加多个 Worker Node 节点。本示例中添加 2 个 Worker Node，kubectl get nodes 输出如下所示，

kubectl get nodes=================NAME               STATUS     ROLES           AGE     VERSIONhost-xx-xx-x-151   NotReady   control-plane   70m     v1.31.1host-xx-xx-x-152   NotReady   <none>          5m43s   v1.31.1host-xx-xx-x-153   NotReady   <none>          13s     v1.31.1

复制代码

上述 kubectl get nodes 输出中显示各节点的状态为 NotReady 状态，这是因为集群还没有安装 CNI 网络插件。通过以下 kubectl apply 命令一键完成安装 CNI 插件。

##安装 CNI 插件kubectl apply -f "https://docs.projectcalico.org/manifests/calico.yaml"=======================。。。clusterrolebinding.rbac.authorization.k8s.io/calico-kube-controllers createdclusterrolebinding.rbac.authorization.k8s.io/calico-node createddaemonset.apps/calico-node createddeployment.apps/calico-kube-controllers created

复制代码

注意，calico.yaml 文件中默认会从 docker.io/calico/node:v3.25.0 拉取镜像，然而 ctr images pull docker.io/xxx 可能因为网络问题而拉取失败，此时应该将 docker.io 修改为国内镜像源如 dokerproxy.cn，见问题列表 “执行 kubectl describe node 报错 cni plugin not initialized”

CRI 网络插件安装成功后，再使用 kubectl get nodes 查看发现所有节点都为 Ready 状态。

##将 control-plane 转为 node 节点kubectl taint nodes host-xx-xx-xx-151 node-role.kubernetes.io/control-plane-===============================node/host-xx-xx-xx-151 untainted
kubectl get nodes=================NAME               STATUS   ROLES           AGE   VERSIONhost-xx-xx-xx-151   Ready    control-plane   29h   v1.31.1host-xx-xx-xx-152   Ready    <none>          28h   v1.31.1host-xx-xx-xx-153   Ready    <none>          28h   v1.31.1

复制代码

2 部署 TiDB Operator

有了 Kubernetes 集群后，下一次是部署 TiDB Operator，这分为 2 个步骤：

安装 TiDB Operator CRDs

TiDB Operator 包含许多实现 TiDB 集群不同组件的自定义资源类型 (CRD)。首先，下载 TiDB Operator CRD 文件，并使用 kubectl create -f crd.yaml 安装 TiDB Operator CRD。

##安装 tidb crdcurl -o https://raw.githubusercontent.com/pingcap/tidb-operator/v1.6.0/manifests/crd.yamlkubectl create -f crd.yaml
##检验 tidb crd 创建成功ubectl get crd | grep tidb==========================tidbclusterautoscalers.pingcap.com                    2024-10-21T06:23:55Ztidbclusters.pingcap.com                              2024-10-21T06:23:55Ztidbdashboards.pingcap.com                            2024-10-21T06:23:55Ztidbinitializers.pingcap.com                          2024-10-21T06:23:55Ztidbmonitors.pingcap.com                              2024-10-21T06:23:56Ztidbngmonitorings.pingcap.com                         2024-10-21T06:23:56Z

复制代码

安装 TiDB Operator

本文使用 helm 安装 TiDB Operator。参考 Helm 官网 https://helm.sh/docs/intro/install/，使用 Script 方式安装 Helm，

##下载 helmcurl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3chmod 700 get_helm.sh
##安装 helmsh get_helm.sh==============Downloading https://get.helm.sh/helm-v3.16.2-linux-arm64.tar.gzVerifying checksum... Done.Preparing to install helm into /usr/local/binhelm installed into /usr/local/bin/helm

复制代码

接着使用 helm repo add pingcap https://charts.pingcap.org/ 添加 PingCAP 仓库，

helm repo add pingcap https://charts.pingcap.org/================================================="pingcap" has been added to your repositorie

复制代码

为 TiDB Operator 创建一个 namespace，执行命令 kubectl create namespace tidb-admin 完成创建，

##创建 tidb-admin namespacekubectl create namespace tidb-admin===================================namespace/tidb-admin created
##查看 namespacekubectl get namespace=====================NAME              STATUS   AGEdefault           Active   23hkube-flannel      Active   19hkube-node-lease   Active   23hkube-public       Active   23hkube-system       Active   23htidb-admin        Active   2m52stigera-operator   Active   19h

复制代码

使用 helm install 安装 TiDB Operator，

helm install --namespace tidb-admin tidb-operator pingcap/tidb-operator --version v1.6.0 \     --set operatorImage=registry.cn-beijing.aliyuncs.com/tidb/tidb-operator:v1.6.0 \     --set tidbBackupManagerImage=registry.cn-beijing.aliyuncs.com/tidb/tidb-backup-manager:v1.6.0 \     --set scheduler.kubeSchedulerImageName=registry.cn-hangzhou.aliyuncs.com/google_containers/kube-scheduler=======================================NAME: tidb-operatorLAST DEPLOYED: Mon Oct 21 14:42:59 2024NAMESPACE: tidb-adminSTATUS: deployedREVISION: 1TEST SUITE: NoneNOTES:Make sure tidb-operator components are running:
    kubectl get pods --namespace tidb-admin -l app.kubernetes.io/instance=tidb-operator

复制代码

检查 TiDB Operator 是否运行起来，根据上述提示命令执行，以下输出表明 TiDB Operator 已经正常安装完成。

kubectl get pods --namespace tidb-admin -l app.kubernetes.io/instance=tidb-operator=========================================================================NAME                                      READY   STATUS    RESTARTS   AGEtidb-controller-manager-6cb84c5b5-r98m5   1/1     Running   0          97s

复制代码

3 部署 TiDB 集群和监控

首先，使用 kubectl create namespace tidb-cluster 创建一个 tidb-cluster 的命名空间，然后使用 kubectl -n tidb-cluster apply -f tidb-cluster.yaml 部署 TiDB 集群。

##创建 tidb-cluster namespacekubectl create namespace tidb-clustercurl -o https://github.com/pingcap/tidb-operator/blob/v1.6.0/examples/advanced/tidb-cluster.yaml
##创建 tidb clusterkubectl delete tc advanced-tidb -n tidb-clusterkubectl -n tidb-cluster apply -f tidb-cluster.yaml==================================================tidbcluster.pingcap.com/advanced-tidb created

复制代码

注意，在 tidb-cluster.yaml 文件中，至少需要配置 storageClassName 参数，这是因为 pd、tidb、tikv 组件均需要持久化存储，否则就会出现问题列表 “执行 kubectl get pods -n tidb-cluster 显示 basic-pd-0 状态为 pending” 中的情况。storageClassName 配置依赖于 PersistentVolume(PV) 的创建，需要提前创建 PV。

安装完成，使用 kubectl get pods -n tidb-cluster 便可以看到正在运行的 TiDB 组件 Pod。

kubectl get pods -n tidb-cluster================================NAME                                      READY   STATUS    RESTARTS   AGEadvanced-tidb-discovery-b8ddc49c5-pm2l6   1/1     Running   0          8m9sadvanced-tidb-pd-0                        1/1     Running   0          8m9sadvanced-tidb-pd-1                        1/1     Running   0          8m9sadvanced-tidb-pd-2                        1/1     Running   0          8m9sadvanced-tidb-tidb-0                      2/2     Running   0          2m38sadvanced-tidb-tidb-1                      2/2     Running   0          3m12sadvanced-tidb-tidb-2                      2/2     Running   0          4m54sadvanced-tidb-tikv-0                      1/1     Running   0          8m2sadvanced-tidb-tikv-1                      1/1     Running   0          8m2sadvanced-tidb-tikv-2                      1/1     Running   0          8m2s

复制代码

4 初始化 TiDB 集群

集群部署好后，可能需要一些初始化的动作，比如设置 root 用户密码、创建用户、设置允许访问的机器等。

初始化 root 密码及创建新用户

通过以下命令修改 root 密码，该命令会创建 root 密码，存到 tidb-secret 的 Secret 里面。

kubectl create secret generic tidb-secret --from-literal=root=root123 --namespace=tidb-cluster

复制代码

以下命令表示在修改 root 密码的同时，创建另外一个普通用户 developer 并设定密码，且创建的用户 developer 默认只有 USAGE 权限。

kubectl create secret generic tidb-secret --from-literal=root=root123 --from-literal=developer=developer123 --namespace=tidb-cluster

复制代码

如果需要其他的初始化动作，则手动编辑 tidb-initializer.yaml 文件并通过执行下述命令来执行初始化。本文假设只初始化 root 用户密码，

kubectl apply -f tidb-initializer.yaml -n tidb-cluster=====================================================tidbinitializer.pingcap.com/tidb-init created

复制代码

注意，在初始化过程中需要根据 image: tnir/mysqlclient 下载镜像，此处可能会遇到镜像下载失败，可参照问题列表中的解决方法。另外，如果是 ARM 环境，需要将 image 修改为 image: kanshiori/mysqlclient-arm64，参考问题列表 *“执行 TiDB 初始化的 pod 状态为 Init:Error 报错 standard_init_linux.go:219”*。

5 连接 TiDB 集群

TiDB 集群创建好后，可以通过 kubectl get all -n tidb-cluster 查看相关信息，包括对外提供的访问地址

kubectl get all -n tidb-cluster==============================NAME                                          READY   STATUS    RESTARTS   AGEpod/advanced-tidb-discovery-b8ddc49c5-pm2l6   1/1     Running   0          48mpod/advanced-tidb-pd-0                        1/1     Running   0          48mpod/advanced-tidb-pd-1                        1/1     Running   0          48mpod/advanced-tidb-pd-2                        1/1     Running   0          48mpod/advanced-tidb-tidb-0                      2/2     Running   0          43mpod/advanced-tidb-tidb-1                      2/2     Running   0          43mpod/advanced-tidb-tidb-2                      2/2     Running   0          45mpod/advanced-tidb-tikv-0                      1/1     Running   0          48mpod/advanced-tidb-tikv-1                      1/1     Running   0          48mpod/advanced-tidb-tikv-2                      1/1     Running   0          48m
NAME                              TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                          AGEservice/advanced-tidb-discovery   ClusterIP   10.96.132.251    <none>        10261/TCP,10262/TCP              48mservice/advanced-tidb-pd          ClusterIP   10.101.219.172   <none>        2379/TCP                         48mservice/advanced-tidb-pd-peer     ClusterIP   None             <none>        2380/TCP,2379/TCP                48mservice/advanced-tidb-tidb        NodePort    10.111.104.136   <none>        4000:31263/TCP,10080:32410/TCP   48mservice/advanced-tidb-tidb-peer   ClusterIP   None             <none>        10080/TCP                        48mservice/advanced-tidb-tikv-peer   ClusterIP   None             <none>        20160/TCP                        48m
NAME                                      READY   UP-TO-DATE   AVAILABLE   AGEdeployment.apps/advanced-tidb-discovery   1/1     1            1           48m
NAME                                                DESIRED   CURRENT   READY   AGEreplicaset.apps/advanced-tidb-discovery-b8ddc49c5   1         1         1       48m
NAME                                  READY   AGEstatefulset.apps/advanced-tidb-pd     3/3     48mstatefulset.apps/advanced-tidb-tidb   3/3     48mstatefulset.apps/advanced-tidb-tikv   3/3     48m

复制代码

上述输出显示 service/advanced-tidb-tidb 对应的 NodePort 为 10.111.104.136， PORT 为 4000。我们可以通过这个 IP 和 PORT 连接到 TiDB 数据库，

mysql -h10.111.104.136 -P4000 -uroot -cERROR 1045 (28000): Access denied for user 'root'@'10.244.115.128' (using password: NO)
mysql -h10.111.104.136 -P4000 -uroot -c -proot123mysql: [Warning] Using a password on the command line interface can be insecure.Welcome to the MySQL monitor.  Commands end with ; or \g.Your MySQL connection id is 3938453128Server version: 8.0.11-TiDB-v8.1.0 TiDB Server (Apache License 2.0) Community Edition, MySQL 8.0 compatible
Copyright (c) 2000, 2023, Oracle and/or its affiliates.
Oracle is a registered trademark of Oracle Corporation and/or itsaffiliates. Other names may be trademarks of their respectiveowners.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
mysql>

复制代码

上述输出中第一次使用无密码的 root 登录失败，这说明上一步中的初始化集群修改 root 密码已经生效，而第二次使用带密码的 root 登录则能正常连接集群。

6 问题记录

执行 kubeadm init 报错 failed to pull image

kubeadm init --config=init.default.yaml。。。[preflight] Pulling images required for setting up a Kubernetes cluster[preflight] This might take a minute or two, depending on the speed of your internet connection[preflight] You can also perform this action beforehand using 'kubeadm config images pull'W1019 09:45:10.324749  794291 checks.go:846] detected that the sandbox image "registry.k8s.io/pause:3.8" of the container runtime is inconsistent with that used by kubeadm.It is recommended to use "registry.aliyuncs.com/google_containers/pause:3.10" as the CRI sandbox image.error execution phase preflight: [preflight] Some fatal errors occurred        [ERROR ImagePull]: failed to pull image registry.aliyuncs.com/google_containers/kube-apiserver:v1.31.0: failed to pull image registry.aliyuncs.com/google_containers/kube-apiserver:v1.31.0: rpc error: code = DeadlineExceeded desc = failed to pull and unpack image "registry.aliyuncs.com/google_containers/kube-apiserver:v1.31.0": failed to resolve reference "registry.aliyuncs.com/google_containers/kube-apiserver:v1.31.0": failed to do request: Head "https://registry.aliyuncs.com/v2/google_containers/kube-apiserver/manifests/v1.31.0": dial tcp 120.55.105.209:443: i/o timeout        [ERROR ImagePull]: failed to pull image registry.aliyuncs.com/google_containers/kube-controller-manager:v1.31.0: failed to pull image registry.aliyuncs.com/google_containers/kube-controller-manager:v1.31.0: rpc error: code = DeadlineExceeded desc = failed to pull and unpack image "registry.aliyuncs.com/google_containers/kube-controller-manager:v1.31.0": failed to resolve reference "registry.aliyuncs.com/google_containers/kube-controller-manager:v1.31.0": failed to do request: Head "https://registry.aliyuncs.com/v2/google_containers/kube-controller-manager/manifests/v1.31.0": dial tcp 120.55.105.209:443: i/o timeout        [ERROR ImagePull]: failed to pull image registry.aliyuncs.com/google_containers/kube-scheduler:v1.31.0: failed to pull image registry.aliyuncs.com/google_containers/kube-scheduler:v1.31.0: rpc error: code = DeadlineExceeded desc = failed to pull and unpack image "registry.aliyuncs.com/google_containers/kube-scheduler:v1.31.0": failed to resolve reference "registry.aliyuncs.com/google_containers/kube-scheduler:v1.31.0": failed to do request: Head "https://registry.aliyuncs.com/v2/google_containers/kube-scheduler/manifests/v1.31.0": dial tcp 120.55.105.209:443: i/o timeout        [ERROR ImagePull]: failed to pull image registry.aliyuncs.com/google_containers/kube-proxy:v1.31.0: failed to pull image registry.aliyuncs.com/google_containers/kube-proxy:v1.31.0: failed to pull and unpack image "registry.aliyuncs.com/google_containers/kube-proxy:v1.31.0": failed to resolve reference "registry.aliyuncs.com/google_containers/kube-proxy:v1.31.0": failed to do request: Head "https://registry.aliyuncs.com/v2/google_containers/kube-proxy/manifests/v1.31.0": dial tcp 120.55.105.209:443: i/o timeout        [ERROR ImagePull]: failed to pull image registry.aliyuncs.com/google_containers/coredns:v1.11.3: failed to pull image registry.aliyuncs.com/google_containers/coredns:v1.11.3: failed to pull and unpack image "registry.aliyuncs.com/google_containers/coredns:v1.11.3": failed to resolve reference "registry.aliyuncs.com/google_containers/coredns:v1.11.3": failed to do request: Head "https://registry.aliyuncs.com/v2/google_containers/coredns/manifests/v1.11.3": dial tcp 120.55.105.209:443: i/o timeout        [ERROR ImagePull]: failed to pull image registry.aliyuncs.com/google_containers/pause:3.10: failed to pull image registry.aliyuncs.com/google_containers/pause:3.10: failed to pull and unpack image "registry.aliyuncs.com/google_containers/pause:3.10": failed to resolve reference "registry.aliyuncs.com/google_containers/pause:3.10": failed to do request: Head "https://registry.aliyuncs.com/v2/google_containers/pause/manifests/3.10": dial tcp 120.55.105.209:443: i/o timeout        [ERROR ImagePull]: failed to pull image registry.aliyuncs.com/google_containers/etcd:3.5.15-0: failed to pull image registry.aliyuncs.com/google_containers/etcd:3.5.15-0: rpc error: code = DeadlineExceeded desc = failed to pull and unpack image "registry.aliyuncs.com/google_containers/etcd:3.5.15-0": failed to resolve reference "registry.aliyuncs.com/google_containers/etcd:3.5.15-0": failed to do request: Head "https://registry.aliyuncs.com/v2/google_containers/etcd/manifests/3.5.15-0": dial tcp 120.55.105.209:443: i/o timeout[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`To see the stack trace of this error execute with --v=5 or higher

复制代码

解决方法：

在使用 containerd 作为容器运行时的 Kubernetes 环境中，因网络限制需配置 HTTPS_PROXY 和 NO_PROXY 以顺利 pull image。参考链接 https://blog.csdn.net/Beer_Do/article/details/113253618

##配置 containerd 的 proxy 代理mkdir /etc/systemd/system/containerd.service.dcat /etc/systemd/system/containerd.service.d/http_proxy.conf << EOF[Service]Environment="HTTP_PROXY=xx.xx.x.x:3128"Environment="HTTPS_PROXY=xx.xx.x.x:3128"Environment="no_proxy=127.0.0.1,localhost,xx.xx.xx.151,10.96.0.0/12"
##重新加载配置并重启 containerd 服务systemctl daemon-reloadsystemctl restart containerd

复制代码

执行 kubeadm init 报错 context deadline exceeded

kubeadm init --config=init.default.yaml。。。[kubelet-check] Waiting for a healthy kubelet at http://127.0.0.1:10248/healthz. This can take up to 4m0s[kubelet-check] The kubelet is healthy after 1.00191638s[api-check] Waiting for a healthy API server. This can take up to 4m0s[api-check] The API server is not healthy after 4m0.000418939s
Unfortunately, an error has occurred:        context deadline exceeded
This error is likely caused by:        - The kubelet is not running        - The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)
If you are on a systemd-powered system, you can try to troubleshoot the error with the following commands:        - 'systemctl status kubelet'        - 'journalctl -xeu kubelet'。。。

复制代码

解决方法：参考链接 https://blog.csdn.net/weixin_43205308/article/details/140554729，查看 containerd 默认的镜像下载地址为 registry.k8s.io，需要修改为与上述相同的国内镜像下载地址 registry.aliyuncs.com/google_containers。需要生成 /etc/containerd/config.toml 文件并修改 sandbox_image 的地址，并重启 containerd 服务生效。

执行 kubeadm init 报错 [ERROR FileContent–proc-sys-net-ipv4-ip_forward]

error execution phase preflight: [preflight] Some fatal errors occurred:        [ERROR FileContent--proc-sys-net-ipv4-ip_forward]: /proc/sys/net/ipv4/ip_forward contents are not set to 1

复制代码

解决方法：此报错是因为上述步骤中配置 sysctl -w net.ipv4.ip_forward=1 未执行，可根据上述说明配置生效。

执行 kubeadm init 报错 [WARNING FileExisting-socat]

[WARNING FileExisting-socat]: socat not found in system path

复制代码

解决方法：系统缺失 socat，执行 yum install socat 安装。

执行 kubeadm init 报错 [WARNING Hostname]

 [WARNING Hostname]: hostname "host-xx-xx-x-151" could not be reached

复制代码

解决方法：在 /etc/hosts 配置 hostname 和 IP 的映射关系。

执行 kubeadm join 报错 couldn’t validate the identity of the API Server

[preflight] Running pre-flight checkserror execution phase preflight: couldn't validate the identity of the API Server: failed to request the cluster-info ConfigMap: client rate limiter Wait returned an error: context deadline exceededTo see the stack trace of this error execute with --v=5 or higher

复制代码

解决方法：此报错因为 token 及 discovery-token-ca-cert-hash 不是当前最新值，通过以下命令来查看当前的 token 和 discovery-token-ca-cert-hash，并使用最新的对应值替换命令行中的值

执行 kubectl describe node 报错 cni plugin not initialized

kubectl describe node host-xx-xx-x-151======================================。。。Conditions:  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message  ----             ------  -----------------                 ------------------                ------                       -------  MemoryPressure   False   Sun, 20 Oct 2024 17:30:26 +0800   Sun, 20 Oct 2024 14:39:52 +0800   KubeletHasSufficientMemory   kubelet has sufficient memory available  DiskPressure     False   Sun, 20 Oct 2024 17:30:26 +0800   Sun, 20 Oct 2024 14:39:52 +0800   KubeletHasNoDiskPressure     kubelet has no disk pressure  PIDPressure      False   Sun, 20 Oct 2024 17:30:26 +0800   Sun, 20 Oct 2024 14:39:52 +0800   KubeletHasSufficientPID      kubelet has sufficient PID available  Ready            False   Sun, 20 Oct 2024 17:30:26 +0800   Sun, 20 Oct 2024 14:39:52 +0800   KubeletNotReady              container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized

复制代码

解决方法：此问题的根本原因是 ctr images pull docker.io/calico/cni:v3.25.0 遇到网络问题，修改为 ctr images pull dockerproxy.cn/calico/cni:v3.25.0 可正常拉取镜像。因此，下载 calico.yaml 文件后替换文件中的 docker.io 为 dockerproxy.cn 即可。

执行 kubectl get pods -n tidb-cluster 显示 basic-pd-0 状态为 pending

kubectl get pods -n tidb-clusterNAME                               READY   STATUS    RESTARTS   AGEbasic-discovery-85c8d6cd7f-wck48   1/1     Running   0          2m3sbasic-pd-0                         0/1     Pending   0          2m3s

复制代码

解决方法：通过 kubectl describe pod 查看发现是 pod 未绑定 PersistentVolumeClaims(PVC)

参考 https://docs.pingcap.com/zh/tidb-in-kubernetes/stable/configure-storage-class#%E6%9C%AC%E5%9C%B0-pv-%E9%85%8D%E7%BD%AE 配置本地存储

注意，需要将 local-volume-provisioner.yaml 配置文件中的 image: “quay.io/external_storage/local-volume-provisioner:v2.3.4” 修改为 image: “quay.io/external_storage/local-volume-provisioner:v2.5.0“，这是因为 v2.3.4 版本过于老旧，可能遇到 no match for platform in manifest: not found 的报错。

##在各节点上创建目录并挂载mkdir /data1/pdk8s/pd -pmkdir /data1/tikvk8s/tikv -pmkdir /data1/tidbk8s/tidb -pmount --bind /data1/pdk8s/pd /data1/pdk8s/pdmount --bind /data1/tikvk8s/tikv /data1/tikvk8s/tikvmount --bind /data1/tidbk8s/tidb /data1/tidbk8s/tidb
##配置本地存储 pvcurl -o https://raw.githubusercontent.com/pingcap/tidb-operator/v1.6.0/examples/local-pv/local-volume-provisioner.yamlvi local-volume-provisioner.yamlkubectl delete -f local-volume-provisioner.yamlkubectl apply -f local-volume-provisioner.yaml
kubectl get po -n kube-system -l app=local-volume-provisioner================================================================NAME                             READY   STATUS    RESTARTS   AGElocal-volume-provisioner-8f7ms   1/1     Running   0          135mlocal-volume-provisioner-xw7h7   1/1     Running   0          136mlocal-volume-provisioner-zj27b   1/1     Running   0          8m58s
kubectl get pv=============NAME                CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS      CLAIM   STORAGECLASS   VOLUMEATTRIBUTESCLASS   REASON   AGElocal-pv-1d87b555   2459Gi     RWO            Delete           Available           tidb-storage   <unset>                          2m12slocal-pv-1fe2627e   2459Gi     RWO            Delete           Available           pd-storage     <unset>                          2m12slocal-pv-4aba16db   2459Gi     RWO            Delete           Available           pd-storage     <unset>                          2m12slocal-pv-4e85cc9d   2459Gi     RWO            Delete           Available           tikv-storage   <unset>                          2m12slocal-pv-547cf652   2459Gi     RWO            Delete           Available           tikv-storage   <unset>                          2m12slocal-pv-6870ef87   2459Gi     RWO            Delete           Available           tidb-storage   <unset>                          2m12slocal-pv-89a42df0   2459Gi     RWO            Delete           Available           tikv-storage   <unset>                          2m12slocal-pv-a898b600   2459Gi     RWO            Delete           Available           pd-storage     <unset>                          2m12slocal-pv-b039092a   2459Gi     RWO            Delete           Available           tidb-storage   <unset>                          2m12s

复制代码

安装 TiDB 集群后 pd pod 状态为 ImagePullBackOff

kubectl describe pod advanced-tidb-pd-2 -n tidb-cluster======================================================。。。Events:  Type     Reason     Age                   From               Message  ----     ------     ----                  ----               -------  Normal   Scheduled  19m                   default-scheduler  Successfully assigned tidb-cluster/advanced-tidb-pd-2 to host-xx-xx-xx-151  Normal   Pulling    18m (x4 over 19m)     kubelet            Pulling image "pingcap/pd:v8.1.0"  Warning  Failed     18m (x4 over 19m)     kubelet            Failed to pull image "pingcap/pd:v8.1.0": failed to pull and unpack image "docker.io/pingcap/pd:v8.1.0": failed to resolve reference "docker.io/pingcap/pd:v8.1.0": failed to do request: Head "https://registry-1.docker.io/v2/pingcap/pd/manifests/v8.1.0": EOF  Warning  Failed     18m (x4 over 19m)     kubelet            Error: ErrImagePull  Warning  Failed     17m (x6 over 19m)     kubelet            Error: ImagePullBackOff  Normal   BackOff    4m23s (x66 over 19m)  kubelet            Back-off pulling image "pingcap/pd:v8.1.0"

复制代码

解决方法：由于默认镜像地址为 docker.io 无法访问，将 tidb-cluster.yaml 中 image 相关地方增加国内镜像地址。需要修改的地方包括以下几处，其中 image: alpine:3.16.0 影响到 TiDB Server 的镜像下载。

执行 TiDB 初始化的 pod 状态为 Init:Error 报错 standard_init_linux.go:219

kubectl get pods -n tidb-cluster================================NAME                                      READY   STATUS       RESTARTS   AGEadvanced-tidb-discovery-b8ddc49c5-pm2l6   1/1     Running      0          70madvanced-tidb-pd-0                        1/1     Running      0          70madvanced-tidb-pd-1                        1/1     Running      0          70madvanced-tidb-pd-2                        1/1     Running      0          70madvanced-tidb-tidb-0                      2/2     Running      0          65madvanced-tidb-tidb-1                      2/2     Running      0          65madvanced-tidb-tidb-2                      2/2     Running      0          67madvanced-tidb-tidb-initializer-k5hjb      0/1     Init:Error   0          70sadvanced-tidb-tikv-0                      1/1     Running      0          70madvanced-tidb-tikv-1                      1/1     Running      0          70madvanced-tidb-tikv-2                      1/1     Running      0          70m
kubectl logs advanced-tidb-tidb-initializer-k5hjb -n tidb-cluster=================================================================Defaulted container "mysql-client" out of: mysql-client, wait (init)Error from server (BadRequest): container "mysql-client" in pod "advanced-tidb-tidb-initializer-k5hjb" is waiting to start: PodInitializing
kubectl logs advanced-tidb-tidb-initializer-47x5r -n tidb-cluster -c wait=========================================================================standard_init_linux.go:219: exec user process caused "exec format error"libcontainer: container start initialization failed: standard_init_linux.go:219: exec user process caused "exec format error"

复制代码

解决方法：此问题的原因是执行环境为 ARM 环境，参考文档 https://docs.pingcap.com/zh/tidb-in-kubernetes/stable/deploy-cluster-on-arm64#%E5%88%9D%E5%A7%8B%E5%8C%96-tidb-%E9%9B%86%E7%BE%A4，需要将 TidbInitializer 定义文件中的 spec.image 字段设置为 ARM64 版本镜像，如 image: kanshiori/mysqlclient-arm64

发布于: 刚刚阅读数: 5

原文链接:【http://xie.infoq.cn/article/54d514df3dfa92d6cbedd0173】。文章转载请联系作者。

TiDB 社区干货传送门

关注

TiDB 社区官网:https://tidb.net/ 2021-12-15 加入

TiDB 社区干货传送门是由 TiDB 社区中布道师组委会自发组织的 TiDB 社区优质内容对外宣布的栏目，旨在加深 TiDBer 之间的交流和学习。一起构建有爱、互助、共创共建的 TiDB 社区 https://tidb.net/

发布

暂无评论

创作场景

小试牛刀 - Kubernetes 上搭建 TiDB 集群

1 创建 Kubernetes 集群

2 部署 TiDB Operator

3 部署 TiDB 集群和监控

4 初始化 TiDB 集群

5 连接 TiDB 集群

6 问题记录

执行 kubeadm init 报错 failed to pull image

执行 kubeadm init 报错 context deadline exceeded

执行 kubeadm init 报错 [ERROR FileContent–proc-sys-net-ipv4-ip_forward]

执行 kubeadm init 报错 [WARNING FileExisting-socat]

执行 kubeadm init 报错 [WARNING Hostname]

执行 kubeadm join 报错 couldn’t validate the identity of the API Server

执行 kubectl describe node 报错 cni plugin not initialized

执行 kubectl get pods -n tidb-cluster 显示 basic-pd-0 状态为 pending

安装 TiDB 集群后 pd pod 状态为 ImagePullBackOff

执行 TiDB 初始化的 pod 状态为 Init:Error 报错 standard_init_linux.go:219

TiDB 社区干货传送门

评论