写点什么

小试牛刀 - Kubernetes 上搭建 TiDB 集群

  • 2024-10-25
    北京
  • 本文字数:17752 字

    阅读完需:约 58 分钟

作者: 数据源的 TiDB 学习之路原文来源:https://tidb.net/blog/21d5b7d0


TiDB Operator 是 Kubernetes 上的 TiDB 集群自动运维系统,提供包括部署、升级、扩缩容、备份恢复、配置变更的 TiDB 全生命周期管理。借助 TiDB Operator,TiDB 可以无缝运行在公有云或自托管的 Kubernetes 集群上。


TiDB Operator 提供了多种方式来部署 Kubernetes 上的 TiDB 集群,本文介绍如何在 Linux 测试环境中创建一个简单的 Kubernetes 集群,部署 TiDB Operator,并使用 TiDB Operator 部署 TiDB 集群。


1 创建 Kubernetes 集群

本文选择使用 kubeadm 部署本地测试 Kubernetes 集群,在使用 kubeadm 创建 Kubernetes 集群前,我们需要先安装好 containerd(容器运行时)、kubelet(以容器方式部署和启动 Kubernetes 的主要服务)、kubeadm(Kubernetes 部署工具) **** 和 kubectl(Kubernetes 命令行客户端),另外需要将参数 net.ipv4.ip_forward 设置为 1。


  • 安装 containerd


Kubernetes 自 1.24 版本后默认使用 CRI(Container Runtime Interface) 兼容的容器运行时来运行 Kubernetes 集群,如 containerd 或 CRI-O。当前环境默认安装 Kubernetes 版本为 1.31,容器运行时使用 containerd,安装步骤参考 https://github.com/containerd/containerd/blob/main/docs/getting-started.md


首先下载安装包,基于环境是 x86 还是 ARM 选择对应的版本,下载地址为:https://github.com/containerd/containerd/releases


##下载及安装 containerdtar -xzvf containerd-1.7.23-linux-arm64.tar.gzmv bin/* /usr/local/bin/
## 配置 systemd 启停mkdir /usr/local/lib/systemd/system -pcd /usr/local/lib/systemd/system
## 编辑 containerd.service, 内容参考 https://raw.githubusercontent.com/containerd/containerd/main/containerd.servicevi containerd.servicesystemctl daemon-reloadsystemctl enable --now containerd## 启动 containerd 服务systemctl restart containerd
## 下载并安装 runc,下载地址为 https://github.com/opencontainers/runc/releasesinstall -m 755 runc.arm64 /usr/local/sbin/runc
复制代码


  • 设置 net.ipv4.ip_forward


临时修改使用命令 sysctl -w net.ipv4.ip_forward=1,如果要永久生效则需要将 net.ipv4.ip_forward=1 添加到 /etc/sysctl.conf 配置文件中并执行 sysctl -p 生效。


  • 安装 kubelet、kubeadm 和 kubectl


添加 kubernetes.repo yum 源,如果下载较慢可自行选择一个国内 yum 源地址


# 此操作会覆盖 /etc/yum.repos.d/kubernetes.repo 中现存的所有配置cat <<EOF | sudo tee /etc/yum.repos.d/kubernetes.repo[kubernetes]name=Kubernetesbaseurl=https://pkgs.k8s.io/core:/stable:/v1.31/rpm/enabled=1gpgcheck=1gpgkey=https://pkgs.k8s.io/core:/stable:/v1.31/rpm/repodata/repomd.xml.keyexclude=kubelet kubeadm kubectl cri-tools kubernetes-cniEOF
复制代码


之后使用 yum install 进行安装,并启动 kubelet 服务及设置开机自动启动


yum install -y kubelet kubeadm kubectl --disableexcludes=kubernetessystemctl start kubeletsystemctl enable kubelet
复制代码


  • 创建 Kubernetes 集群


创建 Kubernetes 集群主要有 2 个步骤:第 1 步是使用 kubeadm init 初始化控制平面,第 2 步则是使用 kubeadm join 添加工作节点。


首先选择一台节点作为控制平面节点,并在节点上使用 kubeadm init 初始化集群。初始化集群需要配置一些必要的参数,这可以通过两种方法实现,一种是编辑 init.yaml 文件并在命令行中通过 –config init.yaml **** 实现,另一种则是直接传递参数名 = 参数值实现。


  1. 通过 –config init.yaml 实现

  2. 通过指定 参数名 = 参数值 实现


上述参数 –image-repository registry.aliyuncs.com/google_containers 表示使用国内阿里云镜像托管站点下载镜像。由于默认 Kubernetes 服务指向的国外镜像下载地址是 registry.k8s.io 可能无法访问,因此需要修改为国内镜像地址。查看默认的镜像下载地址可以通过命令行 kubeadm config images list 实现,


## 查看默认的镜像下载地址kubeadm config images list==============================================================registry.k8s.io/kube-apiserver:v1.31.1registry.k8s.io/kube-controller-manager:v1.31.1registry.k8s.io/kube-scheduler:v1.31.1registry.k8s.io/kube-proxy:v1.31.1registry.k8s.io/coredns/coredns:v1.11.3registry.k8s.io/pause:3.10registry.k8s.io/etcd:3.5.15-0
## 指定--image-repository查看的镜像下载地址kubeadm config images list --image-repository registry.aliyuncs.com/google_containers==============================================================registry.aliyuncs.com/google_containers/kube-apiserver:v1.31.1registry.aliyuncs.com/google_containers/kube-controller-manager:v1.31.1registry.aliyuncs.com/google_containers/kube-scheduler:v1.31.1registry.aliyuncs.com/google_containers/kube-proxy:v1.31.1registry.aliyuncs.com/google_containers/coredns:v1.11.3registry.aliyuncs.com/google_containers/pause:3.10registry.aliyuncs.com/google_containers/etcd:3.5.15-0
复制代码


上述步骤执行可能会由于网络问题而导致镜像拉取失败,这可能是因为 containerd 未配置代理,我们需要根据问题列表 “执行 kubeadm init 报错 failed to pull image” 配置 containerd 代理。


同时,containerd 配置中默认的镜像下载地址仍然是 registry.k8s.io,我们需要将其修改为上述阿里云镜像下载地址,具体方法参数问题列表 *“执行 kubeadm init 报错 context deadline exceeded”*。


现在,重新执行 kubeadm init 便可以正常创建 Kubernetes 集群,输出提示显示 Your Kubernetes control-plane has initialized successfully! 表明控制平面已经初始化成功。


##kubeadm reset表示重置之前初始化的控制平面kubeadm reset
kubeadm init --apiserver-advertise-address=xx.xx.x.151 --image-repository registry.aliyuncs.com/google_containers --service-cidr=10.96.0.0/12 --pod-network-cidr=10.244.0.0/16========================================[init] Using Kubernetes version: v1.31.1。。。Your Kubernetes control-plane has initialized successfully!
To start using your cluster, you need to run the following as a regular user:
mkdir -p $HOME/.kube sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config sudo chown $(id -u):$(id -g) $HOME/.kube/config
Alternatively, if you are the root user, you can run:
export KUBECONFIG=/etc/kubernetes/admin.conf
You should now deploy a pod network to the cluster.Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at: https://kubernetes.io/docs/concepts/cluster-administration/addons/
Then you can join any number of worker nodes by running the following on each as root:
kubeadm join xx.xx.x.151:6443 --token hml2xs.agic16co7u1e8lki \ --discovery-token-ca-cert-hash sha256:570bd607f60eac2d4bde3416dc84ebf9736fd25f20874c293ed372dde2f82f61
复制代码


如果想继续使用集群,根据上述输出提示,根据执行用户是否为 root,需要执行相关步骤,root 用户只需要执行 export KUBECONFIG=/etc/kubernetes/admin.conf 即可。


kubectl -n kube-system get configmap====================================NAME                                                   DATA   AGEcoredns                                                1      7m5sextension-apiserver-authentication                     6      7m8skube-apiserver-legacy-service-account-token-tracking   1      7m8skube-proxy                                             2      7m5skube-root-ca.crt                                       1      7m1skubeadm-config                                         1      7m6skubelet-config                                         1      7m6s
复制代码


虽然,Kubernetes 集群中的控制平面已经创建成功,不过此时还没有可用的工作节点,并缺乏容器网络的配置。下一步我们为集群添加 Node 节点。


首先,使用上述相同的方式在新的节点上安装 containerd、kubeadm、kubelet,并启动 containerd、kubelet 服务。其次,使用 kubeadm join 命令将节点加入集群,可从上述 kubeadm init 输出中复制完整命令,如下所示:


kubeadm join xx.xx.x.151:6443 --token w6cvcg.or8m1644t6vxlzub \        --discovery-token-ca-cert-hash sha256:92793ee4cfd14610de745bc1a604557d54fd69fb2cd1dccc3cc6d24be74ff8cb
复制代码


注意,token 和 discovery-token-ca-cert-hash 要根据控制平面节点当前的实际值填写,否则可能会导致 join 报错,具体查看问题列表 “执行 kubeadm join 报错 couldn’t validate the identity of the API Server” 解决


[preflight] Running pre-flight checks[preflight] Reading configuration from the cluster...[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"[kubelet-start] Starting the kubelet[kubelet-check] Waiting for a healthy kubelet at http://127.0.0.1:10248/healthz. This can take up to 4m0s[kubelet-check] The kubelet is healthy after 501.642626ms[kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap
This node has joined the cluster:* Certificate signing request was sent to apiserver and a response was received.* The Kubelet was informed of the new secure connection details.
Run 'kubectl get nodes' on the control-plane to see this node join the cluster.
复制代码


上述输出证明 Node 已经被添加到 Kubernetes 集群中,同样的步骤在多个节点上执行就可以添加多个 Worker Node 节点。本示例中添加 2 个 Worker Node,kubectl get nodes 输出如下所示,


kubectl get nodes=================NAME               STATUS     ROLES           AGE     VERSIONhost-xx-xx-x-151   NotReady   control-plane   70m     v1.31.1host-xx-xx-x-152   NotReady   <none>          5m43s   v1.31.1host-xx-xx-x-153   NotReady   <none>          13s     v1.31.1
复制代码


上述 kubectl get nodes 输出中显示各节点的状态为 NotReady 状态,这是因为集群还没有安装 CNI 网络插件。通过以下 kubectl apply 命令一键完成安装 CNI 插件。


##安装 CNI 插件kubectl apply -f "https://docs.projectcalico.org/manifests/calico.yaml"=======================。。。clusterrolebinding.rbac.authorization.k8s.io/calico-kube-controllers createdclusterrolebinding.rbac.authorization.k8s.io/calico-node createddaemonset.apps/calico-node createddeployment.apps/calico-kube-controllers created
复制代码


注意,calico.yaml 文件中默认会从 docker.io/calico/node:v3.25.0 拉取镜像,然而 ctr images pull docker.io/xxx 可能因为网络问题而拉取失败,此时应该将 docker.io 修改为国内镜像源如 dokerproxy.cn,见问题列表 “执行 kubectl describe node 报错 cni plugin not initialized”


CRI 网络插件安装成功后,再使用 kubectl get nodes 查看发现所有节点都为 Ready 状态。


##将 control-plane 转为 node 节点kubectl taint nodes host-xx-xx-xx-151 node-role.kubernetes.io/control-plane-===============================node/host-xx-xx-xx-151 untainted
kubectl get nodes=================NAME STATUS ROLES AGE VERSIONhost-xx-xx-xx-151 Ready control-plane 29h v1.31.1host-xx-xx-xx-152 Ready <none> 28h v1.31.1host-xx-xx-xx-153 Ready <none> 28h v1.31.1
复制代码

2 部署 TiDB Operator

有了 Kubernetes 集群后,下一次是部署 TiDB Operator,这分为 2 个步骤:


  1. 安装 TiDB Operator CRDs


TiDB Operator 包含许多实现 TiDB 集群不同组件的自定义资源类型 (CRD)。首先,下载 TiDB Operator CRD 文件,并使用 kubectl create -f crd.yaml 安装 TiDB Operator CRD。


##安装 tidb crdcurl -o https://raw.githubusercontent.com/pingcap/tidb-operator/v1.6.0/manifests/crd.yamlkubectl create -f crd.yaml
##检验 tidb crd 创建成功ubectl get crd | grep tidb==========================tidbclusterautoscalers.pingcap.com 2024-10-21T06:23:55Ztidbclusters.pingcap.com 2024-10-21T06:23:55Ztidbdashboards.pingcap.com 2024-10-21T06:23:55Ztidbinitializers.pingcap.com 2024-10-21T06:23:55Ztidbmonitors.pingcap.com 2024-10-21T06:23:56Ztidbngmonitorings.pingcap.com 2024-10-21T06:23:56Z
复制代码


  1. 安装 TiDB Operator


本文使用 helm 安装 TiDB Operator。参考 Helm 官网 https://helm.sh/docs/intro/install/,使用 Script 方式安装 Helm,


##下载 helmcurl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3chmod 700 get_helm.sh
##安装 helmsh get_helm.sh==============Downloading https://get.helm.sh/helm-v3.16.2-linux-arm64.tar.gzVerifying checksum... Done.Preparing to install helm into /usr/local/binhelm installed into /usr/local/bin/helm
复制代码


接着使用 helm repo add pingcap https://charts.pingcap.org/ 添加 PingCAP 仓库,


helm repo add pingcap https://charts.pingcap.org/================================================="pingcap" has been added to your repositorie
复制代码


为 TiDB Operator 创建一个 namespace,执行命令 kubectl create namespace tidb-admin 完成创建,


##创建 tidb-admin namespacekubectl create namespace tidb-admin===================================namespace/tidb-admin created
##查看 namespacekubectl get namespace=====================NAME STATUS AGEdefault Active 23hkube-flannel Active 19hkube-node-lease Active 23hkube-public Active 23hkube-system Active 23htidb-admin Active 2m52stigera-operator Active 19h
复制代码


使用 helm install 安装 TiDB Operator,


helm install --namespace tidb-admin tidb-operator pingcap/tidb-operator --version v1.6.0 \     --set operatorImage=registry.cn-beijing.aliyuncs.com/tidb/tidb-operator:v1.6.0 \     --set tidbBackupManagerImage=registry.cn-beijing.aliyuncs.com/tidb/tidb-backup-manager:v1.6.0 \     --set scheduler.kubeSchedulerImageName=registry.cn-hangzhou.aliyuncs.com/google_containers/kube-scheduler=======================================NAME: tidb-operatorLAST DEPLOYED: Mon Oct 21 14:42:59 2024NAMESPACE: tidb-adminSTATUS: deployedREVISION: 1TEST SUITE: NoneNOTES:Make sure tidb-operator components are running:
kubectl get pods --namespace tidb-admin -l app.kubernetes.io/instance=tidb-operator
复制代码


检查 TiDB Operator 是否运行起来,根据上述提示命令执行,以下输出表明 TiDB Operator 已经正常安装完成。


kubectl get pods --namespace tidb-admin -l app.kubernetes.io/instance=tidb-operator=========================================================================NAME                                      READY   STATUS    RESTARTS   AGEtidb-controller-manager-6cb84c5b5-r98m5   1/1     Running   0          97s
复制代码

3 部署 TiDB 集群和监控

首先,使用 kubectl create namespace tidb-cluster 创建一个 tidb-cluster 的命名空间,然后使用 kubectl -n tidb-cluster apply -f tidb-cluster.yaml 部署 TiDB 集群。


##创建 tidb-cluster namespacekubectl create namespace tidb-clustercurl -o https://github.com/pingcap/tidb-operator/blob/v1.6.0/examples/advanced/tidb-cluster.yaml
##创建 tidb clusterkubectl delete tc advanced-tidb -n tidb-clusterkubectl -n tidb-cluster apply -f tidb-cluster.yaml==================================================tidbcluster.pingcap.com/advanced-tidb created
复制代码


注意,在 tidb-cluster.yaml 文件中,至少需要配置 storageClassName 参数,这是因为 pd、tidb、tikv 组件均需要持久化存储,否则就会出现问题列表 “执行 kubectl get pods -n tidb-cluster 显示 basic-pd-0 状态为 pending” 中的情况。storageClassName 配置依赖于 PersistentVolume(PV) 的创建,需要提前创建 PV。


安装完成,使用 kubectl get pods -n tidb-cluster 便可以看到正在运行的 TiDB 组件 Pod。


kubectl get pods -n tidb-cluster================================NAME                                      READY   STATUS    RESTARTS   AGEadvanced-tidb-discovery-b8ddc49c5-pm2l6   1/1     Running   0          8m9sadvanced-tidb-pd-0                        1/1     Running   0          8m9sadvanced-tidb-pd-1                        1/1     Running   0          8m9sadvanced-tidb-pd-2                        1/1     Running   0          8m9sadvanced-tidb-tidb-0                      2/2     Running   0          2m38sadvanced-tidb-tidb-1                      2/2     Running   0          3m12sadvanced-tidb-tidb-2                      2/2     Running   0          4m54sadvanced-tidb-tikv-0                      1/1     Running   0          8m2sadvanced-tidb-tikv-1                      1/1     Running   0          8m2sadvanced-tidb-tikv-2                      1/1     Running   0          8m2s
复制代码

4 初始化 TiDB 集群

集群部署好后,可能需要一些初始化的动作,比如设置 root 用户密码、创建用户、设置允许访问的机器等。


  • 初始化 root 密码及创建新用户


通过以下命令修改 root 密码,该命令会创建 root 密码,存到 tidb-secret 的 Secret 里面。


kubectl create secret generic tidb-secret --from-literal=root=root123 --namespace=tidb-cluster
复制代码


以下命令表示在修改 root 密码的同时,创建另外一个普通用户 developer 并设定密码,且创建的用户 developer 默认只有 USAGE 权限。


kubectl create secret generic tidb-secret --from-literal=root=root123 --from-literal=developer=developer123 --namespace=tidb-cluster
复制代码


如果需要其他的初始化动作,则手动编辑 tidb-initializer.yaml 文件并通过执行下述命令来执行初始化。本文假设只初始化 root 用户密码,


kubectl apply -f tidb-initializer.yaml -n tidb-cluster=====================================================tidbinitializer.pingcap.com/tidb-init created
复制代码


注意,在初始化过程中需要根据 image: tnir/mysqlclient 下载镜像,此处可能会遇到镜像下载失败,可参照问题列表中的解决方法。另外,如果是 ARM 环境,需要将 image 修改为 image: kanshiori/mysqlclient-arm64,参考问题列表 *“执行 TiDB 初始化的 pod 状态为 Init:Error 报错 standard_init_linux.go:219”*。

5 连接 TiDB 集群

TiDB 集群创建好后,可以通过 kubectl get all -n tidb-cluster 查看相关信息,包括对外提供的访问地址


kubectl get all -n tidb-cluster==============================NAME                                          READY   STATUS    RESTARTS   AGEpod/advanced-tidb-discovery-b8ddc49c5-pm2l6   1/1     Running   0          48mpod/advanced-tidb-pd-0                        1/1     Running   0          48mpod/advanced-tidb-pd-1                        1/1     Running   0          48mpod/advanced-tidb-pd-2                        1/1     Running   0          48mpod/advanced-tidb-tidb-0                      2/2     Running   0          43mpod/advanced-tidb-tidb-1                      2/2     Running   0          43mpod/advanced-tidb-tidb-2                      2/2     Running   0          45mpod/advanced-tidb-tikv-0                      1/1     Running   0          48mpod/advanced-tidb-tikv-1                      1/1     Running   0          48mpod/advanced-tidb-tikv-2                      1/1     Running   0          48m
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGEservice/advanced-tidb-discovery ClusterIP 10.96.132.251 <none> 10261/TCP,10262/TCP 48mservice/advanced-tidb-pd ClusterIP 10.101.219.172 <none> 2379/TCP 48mservice/advanced-tidb-pd-peer ClusterIP None <none> 2380/TCP,2379/TCP 48mservice/advanced-tidb-tidb NodePort 10.111.104.136 <none> 4000:31263/TCP,10080:32410/TCP 48mservice/advanced-tidb-tidb-peer ClusterIP None <none> 10080/TCP 48mservice/advanced-tidb-tikv-peer ClusterIP None <none> 20160/TCP 48m
NAME READY UP-TO-DATE AVAILABLE AGEdeployment.apps/advanced-tidb-discovery 1/1 1 1 48m
NAME DESIRED CURRENT READY AGEreplicaset.apps/advanced-tidb-discovery-b8ddc49c5 1 1 1 48m
NAME READY AGEstatefulset.apps/advanced-tidb-pd 3/3 48mstatefulset.apps/advanced-tidb-tidb 3/3 48mstatefulset.apps/advanced-tidb-tikv 3/3 48m
复制代码


上述输出显示 service/advanced-tidb-tidb 对应的 NodePort 为 10.111.104.136, PORT 为 4000。我们可以通过这个 IP 和 PORT 连接到 TiDB 数据库,


mysql -h10.111.104.136 -P4000 -uroot -cERROR 1045 (28000): Access denied for user 'root'@'10.244.115.128' (using password: NO)
mysql -h10.111.104.136 -P4000 -uroot -c -proot123mysql: [Warning] Using a password on the command line interface can be insecure.Welcome to the MySQL monitor. Commands end with ; or \g.Your MySQL connection id is 3938453128Server version: 8.0.11-TiDB-v8.1.0 TiDB Server (Apache License 2.0) Community Edition, MySQL 8.0 compatible
Copyright (c) 2000, 2023, Oracle and/or its affiliates.
Oracle is a registered trademark of Oracle Corporation and/or itsaffiliates. Other names may be trademarks of their respectiveowners.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
mysql>
复制代码


上述输出中第一次使用无密码的 root 登录失败,这说明上一步中的初始化集群修改 root 密码已经生效,而第二次使用带密码的 root 登录则能正常连接集群。

6 问题记录

执行 kubeadm init 报错 failed to pull image

kubeadm init --config=init.default.yaml。。。[preflight] Pulling images required for setting up a Kubernetes cluster[preflight] This might take a minute or two, depending on the speed of your internet connection[preflight] You can also perform this action beforehand using 'kubeadm config images pull'W1019 09:45:10.324749  794291 checks.go:846] detected that the sandbox image "registry.k8s.io/pause:3.8" of the container runtime is inconsistent with that used by kubeadm.It is recommended to use "registry.aliyuncs.com/google_containers/pause:3.10" as the CRI sandbox image.error execution phase preflight: [preflight] Some fatal errors occurred        [ERROR ImagePull]: failed to pull image registry.aliyuncs.com/google_containers/kube-apiserver:v1.31.0: failed to pull image registry.aliyuncs.com/google_containers/kube-apiserver:v1.31.0: rpc error: code = DeadlineExceeded desc = failed to pull and unpack image "registry.aliyuncs.com/google_containers/kube-apiserver:v1.31.0": failed to resolve reference "registry.aliyuncs.com/google_containers/kube-apiserver:v1.31.0": failed to do request: Head "https://registry.aliyuncs.com/v2/google_containers/kube-apiserver/manifests/v1.31.0": dial tcp 120.55.105.209:443: i/o timeout        [ERROR ImagePull]: failed to pull image registry.aliyuncs.com/google_containers/kube-controller-manager:v1.31.0: failed to pull image registry.aliyuncs.com/google_containers/kube-controller-manager:v1.31.0: rpc error: code = DeadlineExceeded desc = failed to pull and unpack image "registry.aliyuncs.com/google_containers/kube-controller-manager:v1.31.0": failed to resolve reference "registry.aliyuncs.com/google_containers/kube-controller-manager:v1.31.0": failed to do request: Head "https://registry.aliyuncs.com/v2/google_containers/kube-controller-manager/manifests/v1.31.0": dial tcp 120.55.105.209:443: i/o timeout        [ERROR ImagePull]: failed to pull image registry.aliyuncs.com/google_containers/kube-scheduler:v1.31.0: failed to pull image registry.aliyuncs.com/google_containers/kube-scheduler:v1.31.0: rpc error: code = DeadlineExceeded desc = failed to pull and unpack image "registry.aliyuncs.com/google_containers/kube-scheduler:v1.31.0": failed to resolve reference "registry.aliyuncs.com/google_containers/kube-scheduler:v1.31.0": failed to do request: Head "https://registry.aliyuncs.com/v2/google_containers/kube-scheduler/manifests/v1.31.0": dial tcp 120.55.105.209:443: i/o timeout        [ERROR ImagePull]: failed to pull image registry.aliyuncs.com/google_containers/kube-proxy:v1.31.0: failed to pull image registry.aliyuncs.com/google_containers/kube-proxy:v1.31.0: failed to pull and unpack image "registry.aliyuncs.com/google_containers/kube-proxy:v1.31.0": failed to resolve reference "registry.aliyuncs.com/google_containers/kube-proxy:v1.31.0": failed to do request: Head "https://registry.aliyuncs.com/v2/google_containers/kube-proxy/manifests/v1.31.0": dial tcp 120.55.105.209:443: i/o timeout        [ERROR ImagePull]: failed to pull image registry.aliyuncs.com/google_containers/coredns:v1.11.3: failed to pull image registry.aliyuncs.com/google_containers/coredns:v1.11.3: failed to pull and unpack image "registry.aliyuncs.com/google_containers/coredns:v1.11.3": failed to resolve reference "registry.aliyuncs.com/google_containers/coredns:v1.11.3": failed to do request: Head "https://registry.aliyuncs.com/v2/google_containers/coredns/manifests/v1.11.3": dial tcp 120.55.105.209:443: i/o timeout        [ERROR ImagePull]: failed to pull image registry.aliyuncs.com/google_containers/pause:3.10: failed to pull image registry.aliyuncs.com/google_containers/pause:3.10: failed to pull and unpack image "registry.aliyuncs.com/google_containers/pause:3.10": failed to resolve reference "registry.aliyuncs.com/google_containers/pause:3.10": failed to do request: Head "https://registry.aliyuncs.com/v2/google_containers/pause/manifests/3.10": dial tcp 120.55.105.209:443: i/o timeout        [ERROR ImagePull]: failed to pull image registry.aliyuncs.com/google_containers/etcd:3.5.15-0: failed to pull image registry.aliyuncs.com/google_containers/etcd:3.5.15-0: rpc error: code = DeadlineExceeded desc = failed to pull and unpack image "registry.aliyuncs.com/google_containers/etcd:3.5.15-0": failed to resolve reference "registry.aliyuncs.com/google_containers/etcd:3.5.15-0": failed to do request: Head "https://registry.aliyuncs.com/v2/google_containers/etcd/manifests/3.5.15-0": dial tcp 120.55.105.209:443: i/o timeout[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`To see the stack trace of this error execute with --v=5 or higher
复制代码


  • 解决方法:


在使用 containerd 作为容器运行时的 Kubernetes 环境中,因网络限制需配置 HTTPS_PROXY 和 NO_PROXY 以顺利 pull image。参考链接 https://blog.csdn.net/Beer_Do/article/details/113253618


##配置 containerd 的 proxy 代理mkdir /etc/systemd/system/containerd.service.dcat /etc/systemd/system/containerd.service.d/http_proxy.conf << EOF[Service]Environment="HTTP_PROXY=xx.xx.x.x:3128"Environment="HTTPS_PROXY=xx.xx.x.x:3128"Environment="no_proxy=127.0.0.1,localhost,xx.xx.xx.151,10.96.0.0/12"
##重新加载配置并重启 containerd 服务systemctl daemon-reloadsystemctl restart containerd
复制代码

执行 kubeadm init 报错 context deadline exceeded

kubeadm init --config=init.default.yaml。。。[kubelet-check] Waiting for a healthy kubelet at http://127.0.0.1:10248/healthz. This can take up to 4m0s[kubelet-check] The kubelet is healthy after 1.00191638s[api-check] Waiting for a healthy API server. This can take up to 4m0s[api-check] The API server is not healthy after 4m0.000418939s
Unfortunately, an error has occurred: context deadline exceeded
This error is likely caused by: - The kubelet is not running - The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)
If you are on a systemd-powered system, you can try to troubleshoot the error with the following commands: - 'systemctl status kubelet' - 'journalctl -xeu kubelet'。。。
复制代码


  • 解决方法:参考链接 https://blog.csdn.net/weixin_43205308/article/details/140554729,查看 containerd 默认的镜像下载地址为 registry.k8s.io,需要修改为与上述相同的国内镜像下载地址 registry.aliyuncs.com/google_containers。需要生成 /etc/containerd/config.toml 文件并修改 sandbox_image 的地址,并重启 containerd 服务生效。

执行 kubeadm init 报错 [ERROR FileContent–proc-sys-net-ipv4-ip_forward]

error execution phase preflight: [preflight] Some fatal errors occurred:        [ERROR FileContent--proc-sys-net-ipv4-ip_forward]: /proc/sys/net/ipv4/ip_forward contents are not set to 1
复制代码


  • 解决方法:此报错是因为上述步骤中配置 sysctl -w net.ipv4.ip_forward=1 未执行,可根据上述说明配置生效。

执行 kubeadm init 报错 [WARNING FileExisting-socat]

[WARNING FileExisting-socat]: socat not found in system path
复制代码


  • 解决方法:系统缺失 socat,执行 yum install socat 安装。

执行 kubeadm init 报错 [WARNING Hostname]

 [WARNING Hostname]: hostname "host-xx-xx-x-151" could not be reached
复制代码


  • 解决方法:在 /etc/hosts 配置 hostname 和 IP 的映射关系。

执行 kubeadm join 报错 couldn’t validate the identity of the API Server

[preflight] Running pre-flight checkserror execution phase preflight: couldn't validate the identity of the API Server: failed to request the cluster-info ConfigMap: client rate limiter Wait returned an error: context deadline exceededTo see the stack trace of this error execute with --v=5 or higher
复制代码


  • 解决方法:此报错因为 token 及 discovery-token-ca-cert-hash 不是当前最新值,通过以下命令来查看当前的 token 和 discovery-token-ca-cert-hash,并使用最新的对应值替换命令行中的值

执行 kubectl describe node 报错 cni plugin not initialized

kubectl describe node host-xx-xx-x-151======================================。。。Conditions:  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message  ----             ------  -----------------                 ------------------                ------                       -------  MemoryPressure   False   Sun, 20 Oct 2024 17:30:26 +0800   Sun, 20 Oct 2024 14:39:52 +0800   KubeletHasSufficientMemory   kubelet has sufficient memory available  DiskPressure     False   Sun, 20 Oct 2024 17:30:26 +0800   Sun, 20 Oct 2024 14:39:52 +0800   KubeletHasNoDiskPressure     kubelet has no disk pressure  PIDPressure      False   Sun, 20 Oct 2024 17:30:26 +0800   Sun, 20 Oct 2024 14:39:52 +0800   KubeletHasSufficientPID      kubelet has sufficient PID available  Ready            False   Sun, 20 Oct 2024 17:30:26 +0800   Sun, 20 Oct 2024 14:39:52 +0800   KubeletNotReady              container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized
复制代码


  • 解决方法:此问题的根本原因是 ctr images pull docker.io/calico/cni:v3.25.0 遇到网络问题,修改为 ctr images pull dockerproxy.cn/calico/cni:v3.25.0 可正常拉取镜像。因此,下载 calico.yaml 文件后替换文件中的 docker.io 为 dockerproxy.cn 即可。

执行 kubectl get pods -n tidb-cluster 显示 basic-pd-0 状态为 pending

kubectl get pods -n tidb-clusterNAME                               READY   STATUS    RESTARTS   AGEbasic-discovery-85c8d6cd7f-wck48   1/1     Running   0          2m3sbasic-pd-0                         0/1     Pending   0          2m3s
复制代码


  • 解决方法:通过 kubectl describe pod 查看发现是 pod 未绑定 PersistentVolumeClaims(PVC)


参考 https://docs.pingcap.com/zh/tidb-in-kubernetes/stable/configure-storage-class#%E6%9C%AC%E5%9C%B0-pv-%E9%85%8D%E7%BD%AE 配置本地存储


注意,需要将 local-volume-provisioner.yaml 配置文件中的 image: “quay.io/external_storage/local-volume-provisioner:v2.3.4” 修改为 image: “quay.io/external_storage/local-volume-provisioner:v2.5.0“,这是因为 v2.3.4 版本过于老旧,可能遇到 no match for platform in manifest: not found 的报错。


##在各节点上创建目录并挂载mkdir /data1/pdk8s/pd -pmkdir /data1/tikvk8s/tikv -pmkdir /data1/tidbk8s/tidb -pmount --bind /data1/pdk8s/pd /data1/pdk8s/pdmount --bind /data1/tikvk8s/tikv /data1/tikvk8s/tikvmount --bind /data1/tidbk8s/tidb /data1/tidbk8s/tidb
##配置本地存储 pvcurl -o https://raw.githubusercontent.com/pingcap/tidb-operator/v1.6.0/examples/local-pv/local-volume-provisioner.yamlvi local-volume-provisioner.yamlkubectl delete -f local-volume-provisioner.yamlkubectl apply -f local-volume-provisioner.yaml
kubectl get po -n kube-system -l app=local-volume-provisioner================================================================NAME READY STATUS RESTARTS AGElocal-volume-provisioner-8f7ms 1/1 Running 0 135mlocal-volume-provisioner-xw7h7 1/1 Running 0 136mlocal-volume-provisioner-zj27b 1/1 Running 0 8m58s
kubectl get pv=============NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS VOLUMEATTRIBUTESCLASS REASON AGElocal-pv-1d87b555 2459Gi RWO Delete Available tidb-storage <unset> 2m12slocal-pv-1fe2627e 2459Gi RWO Delete Available pd-storage <unset> 2m12slocal-pv-4aba16db 2459Gi RWO Delete Available pd-storage <unset> 2m12slocal-pv-4e85cc9d 2459Gi RWO Delete Available tikv-storage <unset> 2m12slocal-pv-547cf652 2459Gi RWO Delete Available tikv-storage <unset> 2m12slocal-pv-6870ef87 2459Gi RWO Delete Available tidb-storage <unset> 2m12slocal-pv-89a42df0 2459Gi RWO Delete Available tikv-storage <unset> 2m12slocal-pv-a898b600 2459Gi RWO Delete Available pd-storage <unset> 2m12slocal-pv-b039092a 2459Gi RWO Delete Available tidb-storage <unset> 2m12s
复制代码

安装 TiDB 集群后 pd pod 状态为 ImagePullBackOff

kubectl describe pod advanced-tidb-pd-2 -n tidb-cluster======================================================。。。Events:  Type     Reason     Age                   From               Message  ----     ------     ----                  ----               -------  Normal   Scheduled  19m                   default-scheduler  Successfully assigned tidb-cluster/advanced-tidb-pd-2 to host-xx-xx-xx-151  Normal   Pulling    18m (x4 over 19m)     kubelet            Pulling image "pingcap/pd:v8.1.0"  Warning  Failed     18m (x4 over 19m)     kubelet            Failed to pull image "pingcap/pd:v8.1.0": failed to pull and unpack image "docker.io/pingcap/pd:v8.1.0": failed to resolve reference "docker.io/pingcap/pd:v8.1.0": failed to do request: Head "https://registry-1.docker.io/v2/pingcap/pd/manifests/v8.1.0": EOF  Warning  Failed     18m (x4 over 19m)     kubelet            Error: ErrImagePull  Warning  Failed     17m (x6 over 19m)     kubelet            Error: ImagePullBackOff  Normal   BackOff    4m23s (x66 over 19m)  kubelet            Back-off pulling image "pingcap/pd:v8.1.0"
复制代码


  • 解决方法:由于默认镜像地址为 docker.io 无法访问,将 tidb-cluster.yaml 中 image 相关地方增加国内镜像地址。需要修改的地方包括以下几处,其中 image: alpine:3.16.0 影响到 TiDB Server 的镜像下载。

执行 TiDB 初始化的 pod 状态为 Init:Error 报错 standard_init_linux.go:219

kubectl get pods -n tidb-cluster================================NAME                                      READY   STATUS       RESTARTS   AGEadvanced-tidb-discovery-b8ddc49c5-pm2l6   1/1     Running      0          70madvanced-tidb-pd-0                        1/1     Running      0          70madvanced-tidb-pd-1                        1/1     Running      0          70madvanced-tidb-pd-2                        1/1     Running      0          70madvanced-tidb-tidb-0                      2/2     Running      0          65madvanced-tidb-tidb-1                      2/2     Running      0          65madvanced-tidb-tidb-2                      2/2     Running      0          67madvanced-tidb-tidb-initializer-k5hjb      0/1     Init:Error   0          70sadvanced-tidb-tikv-0                      1/1     Running      0          70madvanced-tidb-tikv-1                      1/1     Running      0          70madvanced-tidb-tikv-2                      1/1     Running      0          70m
kubectl logs advanced-tidb-tidb-initializer-k5hjb -n tidb-cluster=================================================================Defaulted container "mysql-client" out of: mysql-client, wait (init)Error from server (BadRequest): container "mysql-client" in pod "advanced-tidb-tidb-initializer-k5hjb" is waiting to start: PodInitializing
kubectl logs advanced-tidb-tidb-initializer-47x5r -n tidb-cluster -c wait=========================================================================standard_init_linux.go:219: exec user process caused "exec format error"libcontainer: container start initialization failed: standard_init_linux.go:219: exec user process caused "exec format error"
复制代码


  • 解决方法:此问题的原因是执行环境为 ARM 环境,参考文档 https://docs.pingcap.com/zh/tidb-in-kubernetes/stable/deploy-cluster-on-arm64#%E5%88%9D%E5%A7%8B%E5%8C%96-tidb-%E9%9B%86%E7%BE%A4, 需要将 TidbInitializer 定义文件中的 spec.image 字段设置为 ARM64 版本镜像,如 image: kanshiori/mysqlclient-arm64


发布于: 刚刚阅读数: 5
用户头像

TiDB 社区官网:https://tidb.net/ 2021-12-15 加入

TiDB 社区干货传送门是由 TiDB 社区中布道师组委会自发组织的 TiDB 社区优质内容对外宣布的栏目,旨在加深 TiDBer 之间的交流和学习。一起构建有爱、互助、共创共建的 TiDB 社区 https://tidb.net/

评论

发布
暂无评论
小试牛刀 - Kubernetes 上搭建 TiDB 集群_实践案例_TiDB 社区干货传送门_InfoQ写作社区