作者: 数据源的 TiDB 学习之路原文来源:https://tidb.net/blog/21d5b7d0
TiDB Operator 是 Kubernetes 上的 TiDB 集群自动运维系统,提供包括部署、升级、扩缩容、备份恢复、配置变更的 TiDB 全生命周期管理。借助 TiDB Operator,TiDB 可以无缝运行在公有云或自托管的 Kubernetes 集群上。
TiDB Operator 提供了多种方式来部署 Kubernetes 上的 TiDB 集群,本文介绍如何在 Linux 测试环境中创建一个简单的 Kubernetes 集群,部署 TiDB Operator,并使用 TiDB Operator 部署 TiDB 集群。
1 创建 Kubernetes 集群
本文选择使用 kubeadm 部署本地测试 Kubernetes 集群,在使用 kubeadm 创建 Kubernetes 集群前,我们需要先安装好 containerd(容器运行时)、kubelet(以容器方式部署和启动 Kubernetes 的主要服务)、kubeadm(Kubernetes 部署工具) **** 和 kubectl(Kubernetes 命令行客户端),另外需要将参数 net.ipv4.ip_forward 设置为 1。
Kubernetes 自 1.24 版本后默认使用 CRI(Container Runtime Interface) 兼容的容器运行时来运行 Kubernetes 集群,如 containerd 或 CRI-O。当前环境默认安装 Kubernetes 版本为 1.31,容器运行时使用 containerd,安装步骤参考 https://github.com/containerd/containerd/blob/main/docs/getting-started.md
首先下载安装包,基于环境是 x86 还是 ARM 选择对应的版本,下载地址为:https://github.com/containerd/containerd/releases
##下载及安装 containerd
tar -xzvf containerd-1.7.23-linux-arm64.tar.gz
mv bin/* /usr/local/bin/
## 配置 systemd 启停
mkdir /usr/local/lib/systemd/system -p
cd /usr/local/lib/systemd/system
## 编辑 containerd.service, 内容参考 https://raw.githubusercontent.com/containerd/containerd/main/containerd.service
vi containerd.service
systemctl daemon-reload
systemctl enable --now containerd
## 启动 containerd 服务
systemctl restart containerd
## 下载并安装 runc,下载地址为 https://github.com/opencontainers/runc/releases
install -m 755 runc.arm64 /usr/local/sbin/runc
复制代码
临时修改使用命令 sysctl -w net.ipv4.ip_forward=1,如果要永久生效则需要将 net.ipv4.ip_forward=1 添加到 /etc/sysctl.conf 配置文件中并执行 sysctl -p 生效。
添加 kubernetes.repo yum 源,如果下载较慢可自行选择一个国内 yum 源地址
# 此操作会覆盖 /etc/yum.repos.d/kubernetes.repo 中现存的所有配置
cat <<EOF | sudo tee /etc/yum.repos.d/kubernetes.repo
[kubernetes]
name=Kubernetes
baseurl=https://pkgs.k8s.io/core:/stable:/v1.31/rpm/
enabled=1
gpgcheck=1
gpgkey=https://pkgs.k8s.io/core:/stable:/v1.31/rpm/repodata/repomd.xml.key
exclude=kubelet kubeadm kubectl cri-tools kubernetes-cni
EOF
复制代码
之后使用 yum install 进行安装,并启动 kubelet 服务及设置开机自动启动
yum install -y kubelet kubeadm kubectl --disableexcludes=kubernetes
systemctl start kubelet
systemctl enable kubelet
复制代码
创建 Kubernetes 集群主要有 2 个步骤:第 1 步是使用 kubeadm init 初始化控制平面,第 2 步则是使用 kubeadm join 添加工作节点。
首先选择一台节点作为控制平面节点,并在节点上使用 kubeadm init 初始化集群。初始化集群需要配置一些必要的参数,这可以通过两种方法实现,一种是编辑 init.yaml 文件并在命令行中通过 –config init.yaml **** 实现,另一种则是直接传递参数名 = 参数值实现。
通过 –config init.yaml 实现
通过指定 参数名 = 参数值 实现
上述参数 –image-repository registry.aliyuncs.com/google_containers 表示使用国内阿里云镜像托管站点下载镜像。由于默认 Kubernetes 服务指向的国外镜像下载地址是 registry.k8s.io 可能无法访问,因此需要修改为国内镜像地址。查看默认的镜像下载地址可以通过命令行 kubeadm config images list 实现,
## 查看默认的镜像下载地址
kubeadm config images list
==============================================================
registry.k8s.io/kube-apiserver:v1.31.1
registry.k8s.io/kube-controller-manager:v1.31.1
registry.k8s.io/kube-scheduler:v1.31.1
registry.k8s.io/kube-proxy:v1.31.1
registry.k8s.io/coredns/coredns:v1.11.3
registry.k8s.io/pause:3.10
registry.k8s.io/etcd:3.5.15-0
## 指定--image-repository查看的镜像下载地址
kubeadm config images list --image-repository registry.aliyuncs.com/google_containers
==============================================================
registry.aliyuncs.com/google_containers/kube-apiserver:v1.31.1
registry.aliyuncs.com/google_containers/kube-controller-manager:v1.31.1
registry.aliyuncs.com/google_containers/kube-scheduler:v1.31.1
registry.aliyuncs.com/google_containers/kube-proxy:v1.31.1
registry.aliyuncs.com/google_containers/coredns:v1.11.3
registry.aliyuncs.com/google_containers/pause:3.10
registry.aliyuncs.com/google_containers/etcd:3.5.15-0
复制代码
上述步骤执行可能会由于网络问题而导致镜像拉取失败,这可能是因为 containerd 未配置代理,我们需要根据问题列表 “执行 kubeadm init 报错 failed to pull image” 配置 containerd 代理。
同时,containerd 配置中默认的镜像下载地址仍然是 registry.k8s.io,我们需要将其修改为上述阿里云镜像下载地址,具体方法参数问题列表 *“执行 kubeadm init 报错 context deadline exceeded”*。
现在,重新执行 kubeadm init 便可以正常创建 Kubernetes 集群,输出提示显示 Your Kubernetes control-plane has initialized successfully! 表明控制平面已经初始化成功。
##kubeadm reset表示重置之前初始化的控制平面
kubeadm reset
kubeadm init --apiserver-advertise-address=xx.xx.x.151 --image-repository registry.aliyuncs.com/google_containers --service-cidr=10.96.0.0/12 --pod-network-cidr=10.244.0.0/16
========================================
[init] Using Kubernetes version: v1.31.1
。。。
Your Kubernetes control-plane has initialized successfully!
To start using your cluster, you need to run the following as a regular user:
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
Alternatively, if you are the root user, you can run:
export KUBECONFIG=/etc/kubernetes/admin.conf
You should now deploy a pod network to the cluster.
Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:
https://kubernetes.io/docs/concepts/cluster-administration/addons/
Then you can join any number of worker nodes by running the following on each as root:
kubeadm join xx.xx.x.151:6443 --token hml2xs.agic16co7u1e8lki \
--discovery-token-ca-cert-hash sha256:570bd607f60eac2d4bde3416dc84ebf9736fd25f20874c293ed372dde2f82f61
复制代码
如果想继续使用集群,根据上述输出提示,根据执行用户是否为 root,需要执行相关步骤,root 用户只需要执行 export KUBECONFIG=/etc/kubernetes/admin.conf 即可。
kubectl -n kube-system get configmap
====================================
NAME DATA AGE
coredns 1 7m5s
extension-apiserver-authentication 6 7m8s
kube-apiserver-legacy-service-account-token-tracking 1 7m8s
kube-proxy 2 7m5s
kube-root-ca.crt 1 7m1s
kubeadm-config 1 7m6s
kubelet-config 1 7m6s
复制代码
虽然,Kubernetes 集群中的控制平面已经创建成功,不过此时还没有可用的工作节点,并缺乏容器网络的配置。下一步我们为集群添加 Node 节点。
首先,使用上述相同的方式在新的节点上安装 containerd、kubeadm、kubelet,并启动 containerd、kubelet 服务。其次,使用 kubeadm join 命令将节点加入集群,可从上述 kubeadm init 输出中复制完整命令,如下所示:
kubeadm join xx.xx.x.151:6443 --token w6cvcg.or8m1644t6vxlzub \
--discovery-token-ca-cert-hash sha256:92793ee4cfd14610de745bc1a604557d54fd69fb2cd1dccc3cc6d24be74ff8cb
复制代码
注意,token 和 discovery-token-ca-cert-hash 要根据控制平面节点当前的实际值填写,否则可能会导致 join 报错,具体查看问题列表 “执行 kubeadm join 报错 couldn’t validate the identity of the API Server” 解决
[preflight] Running pre-flight checks
[preflight] Reading configuration from the cluster...
[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Starting the kubelet
[kubelet-check] Waiting for a healthy kubelet at http://127.0.0.1:10248/healthz. This can take up to 4m0s
[kubelet-check] The kubelet is healthy after 501.642626ms
[kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap
This node has joined the cluster:
* Certificate signing request was sent to apiserver and a response was received.
* The Kubelet was informed of the new secure connection details.
Run 'kubectl get nodes' on the control-plane to see this node join the cluster.
复制代码
上述输出证明 Node 已经被添加到 Kubernetes 集群中,同样的步骤在多个节点上执行就可以添加多个 Worker Node 节点。本示例中添加 2 个 Worker Node,kubectl get nodes 输出如下所示,
kubectl get nodes
=================
NAME STATUS ROLES AGE VERSION
host-xx-xx-x-151 NotReady control-plane 70m v1.31.1
host-xx-xx-x-152 NotReady <none> 5m43s v1.31.1
host-xx-xx-x-153 NotReady <none> 13s v1.31.1
复制代码
上述 kubectl get nodes 输出中显示各节点的状态为 NotReady 状态,这是因为集群还没有安装 CNI 网络插件。通过以下 kubectl apply 命令一键完成安装 CNI 插件。
##安装 CNI 插件
kubectl apply -f "https://docs.projectcalico.org/manifests/calico.yaml"
=======================
。。。
clusterrolebinding.rbac.authorization.k8s.io/calico-kube-controllers created
clusterrolebinding.rbac.authorization.k8s.io/calico-node created
daemonset.apps/calico-node created
deployment.apps/calico-kube-controllers created
复制代码
注意,calico.yaml 文件中默认会从 docker.io/calico/node:v3.25.0 拉取镜像,然而 ctr images pull docker.io/xxx 可能因为网络问题而拉取失败,此时应该将 docker.io 修改为国内镜像源如 dokerproxy.cn,见问题列表 “执行 kubectl describe node 报错 cni plugin not initialized”
CRI 网络插件安装成功后,再使用 kubectl get nodes 查看发现所有节点都为 Ready 状态。
##将 control-plane 转为 node 节点
kubectl taint nodes host-xx-xx-xx-151 node-role.kubernetes.io/control-plane-
===============================
node/host-xx-xx-xx-151 untainted
kubectl get nodes
=================
NAME STATUS ROLES AGE VERSION
host-xx-xx-xx-151 Ready control-plane 29h v1.31.1
host-xx-xx-xx-152 Ready <none> 28h v1.31.1
host-xx-xx-xx-153 Ready <none> 28h v1.31.1
复制代码
2 部署 TiDB Operator
有了 Kubernetes 集群后,下一次是部署 TiDB Operator,这分为 2 个步骤:
安装 TiDB Operator CRDs
TiDB Operator 包含许多实现 TiDB 集群不同组件的自定义资源类型 (CRD)。首先,下载 TiDB Operator CRD 文件,并使用 kubectl create -f crd.yaml 安装 TiDB Operator CRD。
##安装 tidb crd
curl -o https://raw.githubusercontent.com/pingcap/tidb-operator/v1.6.0/manifests/crd.yaml
kubectl create -f crd.yaml
##检验 tidb crd 创建成功
ubectl get crd | grep tidb
==========================
tidbclusterautoscalers.pingcap.com 2024-10-21T06:23:55Z
tidbclusters.pingcap.com 2024-10-21T06:23:55Z
tidbdashboards.pingcap.com 2024-10-21T06:23:55Z
tidbinitializers.pingcap.com 2024-10-21T06:23:55Z
tidbmonitors.pingcap.com 2024-10-21T06:23:56Z
tidbngmonitorings.pingcap.com 2024-10-21T06:23:56Z
复制代码
安装 TiDB Operator
本文使用 helm 安装 TiDB Operator。参考 Helm 官网 https://helm.sh/docs/intro/install/,使用 Script 方式安装 Helm,
##下载 helm
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
chmod 700 get_helm.sh
##安装 helm
sh get_helm.sh
==============
Downloading https://get.helm.sh/helm-v3.16.2-linux-arm64.tar.gz
Verifying checksum... Done.
Preparing to install helm into /usr/local/bin
helm installed into /usr/local/bin/helm
复制代码
接着使用 helm repo add pingcap https://charts.pingcap.org/ 添加 PingCAP 仓库,
helm repo add pingcap https://charts.pingcap.org/
=================================================
"pingcap" has been added to your repositorie
复制代码
为 TiDB Operator 创建一个 namespace,执行命令 kubectl create namespace tidb-admin 完成创建,
##创建 tidb-admin namespace
kubectl create namespace tidb-admin
===================================
namespace/tidb-admin created
##查看 namespace
kubectl get namespace
=====================
NAME STATUS AGE
default Active 23h
kube-flannel Active 19h
kube-node-lease Active 23h
kube-public Active 23h
kube-system Active 23h
tidb-admin Active 2m52s
tigera-operator Active 19h
复制代码
使用 helm install 安装 TiDB Operator,
helm install --namespace tidb-admin tidb-operator pingcap/tidb-operator --version v1.6.0 \
--set operatorImage=registry.cn-beijing.aliyuncs.com/tidb/tidb-operator:v1.6.0 \
--set tidbBackupManagerImage=registry.cn-beijing.aliyuncs.com/tidb/tidb-backup-manager:v1.6.0 \
--set scheduler.kubeSchedulerImageName=registry.cn-hangzhou.aliyuncs.com/google_containers/kube-scheduler
=======================================
NAME: tidb-operator
LAST DEPLOYED: Mon Oct 21 14:42:59 2024
NAMESPACE: tidb-admin
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
Make sure tidb-operator components are running:
kubectl get pods --namespace tidb-admin -l app.kubernetes.io/instance=tidb-operator
复制代码
检查 TiDB Operator 是否运行起来,根据上述提示命令执行,以下输出表明 TiDB Operator 已经正常安装完成。
kubectl get pods --namespace tidb-admin -l app.kubernetes.io/instance=tidb-operator
=========================================================================
NAME READY STATUS RESTARTS AGE
tidb-controller-manager-6cb84c5b5-r98m5 1/1 Running 0 97s
复制代码
3 部署 TiDB 集群和监控
首先,使用 kubectl create namespace tidb-cluster 创建一个 tidb-cluster 的命名空间,然后使用 kubectl -n tidb-cluster apply -f tidb-cluster.yaml 部署 TiDB 集群。
##创建 tidb-cluster namespace
kubectl create namespace tidb-cluster
curl -o https://github.com/pingcap/tidb-operator/blob/v1.6.0/examples/advanced/tidb-cluster.yaml
##创建 tidb cluster
kubectl delete tc advanced-tidb -n tidb-cluster
kubectl -n tidb-cluster apply -f tidb-cluster.yaml
==================================================
tidbcluster.pingcap.com/advanced-tidb created
复制代码
注意,在 tidb-cluster.yaml 文件中,至少需要配置 storageClassName 参数,这是因为 pd、tidb、tikv 组件均需要持久化存储,否则就会出现问题列表 “执行 kubectl get pods -n tidb-cluster 显示 basic-pd-0 状态为 pending” 中的情况。storageClassName 配置依赖于 PersistentVolume(PV) 的创建,需要提前创建 PV。
安装完成,使用 kubectl get pods -n tidb-cluster 便可以看到正在运行的 TiDB 组件 Pod。
kubectl get pods -n tidb-cluster
================================
NAME READY STATUS RESTARTS AGE
advanced-tidb-discovery-b8ddc49c5-pm2l6 1/1 Running 0 8m9s
advanced-tidb-pd-0 1/1 Running 0 8m9s
advanced-tidb-pd-1 1/1 Running 0 8m9s
advanced-tidb-pd-2 1/1 Running 0 8m9s
advanced-tidb-tidb-0 2/2 Running 0 2m38s
advanced-tidb-tidb-1 2/2 Running 0 3m12s
advanced-tidb-tidb-2 2/2 Running 0 4m54s
advanced-tidb-tikv-0 1/1 Running 0 8m2s
advanced-tidb-tikv-1 1/1 Running 0 8m2s
advanced-tidb-tikv-2 1/1 Running 0 8m2s
复制代码
4 初始化 TiDB 集群
集群部署好后,可能需要一些初始化的动作,比如设置 root 用户密码、创建用户、设置允许访问的机器等。
通过以下命令修改 root 密码,该命令会创建 root 密码,存到 tidb-secret 的 Secret 里面。
kubectl create secret generic tidb-secret --from-literal=root=root123 --namespace=tidb-cluster
复制代码
以下命令表示在修改 root 密码的同时,创建另外一个普通用户 developer 并设定密码,且创建的用户 developer 默认只有 USAGE 权限。
kubectl create secret generic tidb-secret --from-literal=root=root123 --from-literal=developer=developer123 --namespace=tidb-cluster
复制代码
如果需要其他的初始化动作,则手动编辑 tidb-initializer.yaml 文件并通过执行下述命令来执行初始化。本文假设只初始化 root 用户密码,
kubectl apply -f tidb-initializer.yaml -n tidb-cluster
=====================================================
tidbinitializer.pingcap.com/tidb-init created
复制代码
注意,在初始化过程中需要根据 image: tnir/mysqlclient 下载镜像,此处可能会遇到镜像下载失败,可参照问题列表中的解决方法。另外,如果是 ARM 环境,需要将 image 修改为 image: kanshiori/mysqlclient-arm64,参考问题列表 *“执行 TiDB 初始化的 pod 状态为 Init:Error 报错 standard_init_linux.go:219”*。
5 连接 TiDB 集群
TiDB 集群创建好后,可以通过 kubectl get all -n tidb-cluster 查看相关信息,包括对外提供的访问地址
kubectl get all -n tidb-cluster
==============================
NAME READY STATUS RESTARTS AGE
pod/advanced-tidb-discovery-b8ddc49c5-pm2l6 1/1 Running 0 48m
pod/advanced-tidb-pd-0 1/1 Running 0 48m
pod/advanced-tidb-pd-1 1/1 Running 0 48m
pod/advanced-tidb-pd-2 1/1 Running 0 48m
pod/advanced-tidb-tidb-0 2/2 Running 0 43m
pod/advanced-tidb-tidb-1 2/2 Running 0 43m
pod/advanced-tidb-tidb-2 2/2 Running 0 45m
pod/advanced-tidb-tikv-0 1/1 Running 0 48m
pod/advanced-tidb-tikv-1 1/1 Running 0 48m
pod/advanced-tidb-tikv-2 1/1 Running 0 48m
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/advanced-tidb-discovery ClusterIP 10.96.132.251 <none> 10261/TCP,10262/TCP 48m
service/advanced-tidb-pd ClusterIP 10.101.219.172 <none> 2379/TCP 48m
service/advanced-tidb-pd-peer ClusterIP None <none> 2380/TCP,2379/TCP 48m
service/advanced-tidb-tidb NodePort 10.111.104.136 <none> 4000:31263/TCP,10080:32410/TCP 48m
service/advanced-tidb-tidb-peer ClusterIP None <none> 10080/TCP 48m
service/advanced-tidb-tikv-peer ClusterIP None <none> 20160/TCP 48m
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/advanced-tidb-discovery 1/1 1 1 48m
NAME DESIRED CURRENT READY AGE
replicaset.apps/advanced-tidb-discovery-b8ddc49c5 1 1 1 48m
NAME READY AGE
statefulset.apps/advanced-tidb-pd 3/3 48m
statefulset.apps/advanced-tidb-tidb 3/3 48m
statefulset.apps/advanced-tidb-tikv 3/3 48m
复制代码
上述输出显示 service/advanced-tidb-tidb 对应的 NodePort 为 10.111.104.136, PORT 为 4000。我们可以通过这个 IP 和 PORT 连接到 TiDB 数据库,
mysql -h10.111.104.136 -P4000 -uroot -c
ERROR 1045 (28000): Access denied for user 'root'@'10.244.115.128' (using password: NO)
mysql -h10.111.104.136 -P4000 -uroot -c -proot123
mysql: [Warning] Using a password on the command line interface can be insecure.
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 3938453128
Server version: 8.0.11-TiDB-v8.1.0 TiDB Server (Apache License 2.0) Community Edition, MySQL 8.0 compatible
Copyright (c) 2000, 2023, Oracle and/or its affiliates.
Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
mysql>
复制代码
上述输出中第一次使用无密码的 root 登录失败,这说明上一步中的初始化集群修改 root 密码已经生效,而第二次使用带密码的 root 登录则能正常连接集群。
6 问题记录
执行 kubeadm init 报错 failed to pull image
kubeadm init --config=init.default.yaml
。。。
[preflight] Pulling images required for setting up a Kubernetes cluster
[preflight] This might take a minute or two, depending on the speed of your internet connection
[preflight] You can also perform this action beforehand using 'kubeadm config images pull'
W1019 09:45:10.324749 794291 checks.go:846] detected that the sandbox image "registry.k8s.io/pause:3.8" of the container runtime is inconsistent with that used by kubeadm.It is recommended to use "registry.aliyuncs.com/google_containers/pause:3.10" as the CRI sandbox image.
error execution phase preflight: [preflight] Some fatal errors occurred
[ERROR ImagePull]: failed to pull image registry.aliyuncs.com/google_containers/kube-apiserver:v1.31.0: failed to pull image registry.aliyuncs.com/google_containers/kube-apiserver:v1.31.0: rpc error: code = DeadlineExceeded desc = failed to pull and unpack image "registry.aliyuncs.com/google_containers/kube-apiserver:v1.31.0": failed to resolve reference "registry.aliyuncs.com/google_containers/kube-apiserver:v1.31.0": failed to do request: Head "https://registry.aliyuncs.com/v2/google_containers/kube-apiserver/manifests/v1.31.0": dial tcp 120.55.105.209:443: i/o timeout
[ERROR ImagePull]: failed to pull image registry.aliyuncs.com/google_containers/kube-controller-manager:v1.31.0: failed to pull image registry.aliyuncs.com/google_containers/kube-controller-manager:v1.31.0: rpc error: code = DeadlineExceeded desc = failed to pull and unpack image "registry.aliyuncs.com/google_containers/kube-controller-manager:v1.31.0": failed to resolve reference "registry.aliyuncs.com/google_containers/kube-controller-manager:v1.31.0": failed to do request: Head "https://registry.aliyuncs.com/v2/google_containers/kube-controller-manager/manifests/v1.31.0": dial tcp 120.55.105.209:443: i/o timeout
[ERROR ImagePull]: failed to pull image registry.aliyuncs.com/google_containers/kube-scheduler:v1.31.0: failed to pull image registry.aliyuncs.com/google_containers/kube-scheduler:v1.31.0: rpc error: code = DeadlineExceeded desc = failed to pull and unpack image "registry.aliyuncs.com/google_containers/kube-scheduler:v1.31.0": failed to resolve reference "registry.aliyuncs.com/google_containers/kube-scheduler:v1.31.0": failed to do request: Head "https://registry.aliyuncs.com/v2/google_containers/kube-scheduler/manifests/v1.31.0": dial tcp 120.55.105.209:443: i/o timeout
[ERROR ImagePull]: failed to pull image registry.aliyuncs.com/google_containers/kube-proxy:v1.31.0: failed to pull image registry.aliyuncs.com/google_containers/kube-proxy:v1.31.0: failed to pull and unpack image "registry.aliyuncs.com/google_containers/kube-proxy:v1.31.0": failed to resolve reference "registry.aliyuncs.com/google_containers/kube-proxy:v1.31.0": failed to do request: Head "https://registry.aliyuncs.com/v2/google_containers/kube-proxy/manifests/v1.31.0": dial tcp 120.55.105.209:443: i/o timeout
[ERROR ImagePull]: failed to pull image registry.aliyuncs.com/google_containers/coredns:v1.11.3: failed to pull image registry.aliyuncs.com/google_containers/coredns:v1.11.3: failed to pull and unpack image "registry.aliyuncs.com/google_containers/coredns:v1.11.3": failed to resolve reference "registry.aliyuncs.com/google_containers/coredns:v1.11.3": failed to do request: Head "https://registry.aliyuncs.com/v2/google_containers/coredns/manifests/v1.11.3": dial tcp 120.55.105.209:443: i/o timeout
[ERROR ImagePull]: failed to pull image registry.aliyuncs.com/google_containers/pause:3.10: failed to pull image registry.aliyuncs.com/google_containers/pause:3.10: failed to pull and unpack image "registry.aliyuncs.com/google_containers/pause:3.10": failed to resolve reference "registry.aliyuncs.com/google_containers/pause:3.10": failed to do request: Head "https://registry.aliyuncs.com/v2/google_containers/pause/manifests/3.10": dial tcp 120.55.105.209:443: i/o timeout
[ERROR ImagePull]: failed to pull image registry.aliyuncs.com/google_containers/etcd:3.5.15-0: failed to pull image registry.aliyuncs.com/google_containers/etcd:3.5.15-0: rpc error: code = DeadlineExceeded desc = failed to pull and unpack image "registry.aliyuncs.com/google_containers/etcd:3.5.15-0": failed to resolve reference "registry.aliyuncs.com/google_containers/etcd:3.5.15-0": failed to do request: Head "https://registry.aliyuncs.com/v2/google_containers/etcd/manifests/3.5.15-0": dial tcp 120.55.105.209:443: i/o timeout
[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`
To see the stack trace of this error execute with --v=5 or higher
复制代码
在使用 containerd 作为容器运行时的 Kubernetes 环境中,因网络限制需配置 HTTPS_PROXY 和 NO_PROXY 以顺利 pull image。参考链接 https://blog.csdn.net/Beer_Do/article/details/113253618
##配置 containerd 的 proxy 代理
mkdir /etc/systemd/system/containerd.service.d
cat /etc/systemd/system/containerd.service.d/http_proxy.conf << EOF
[Service]
Environment="HTTP_PROXY=xx.xx.x.x:3128"
Environment="HTTPS_PROXY=xx.xx.x.x:3128"
Environment="no_proxy=127.0.0.1,localhost,xx.xx.xx.151,10.96.0.0/12"
##重新加载配置并重启 containerd 服务
systemctl daemon-reload
systemctl restart containerd
复制代码
执行 kubeadm init 报错 context deadline exceeded
kubeadm init --config=init.default.yaml
。。。
[kubelet-check] Waiting for a healthy kubelet at http://127.0.0.1:10248/healthz. This can take up to 4m0s
[kubelet-check] The kubelet is healthy after 1.00191638s
[api-check] Waiting for a healthy API server. This can take up to 4m0s
[api-check] The API server is not healthy after 4m0.000418939s
Unfortunately, an error has occurred:
context deadline exceeded
This error is likely caused by:
- The kubelet is not running
- The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)
If you are on a systemd-powered system, you can try to troubleshoot the error with the following commands:
- 'systemctl status kubelet'
- 'journalctl -xeu kubelet'
。。。
复制代码
执行 kubeadm init 报错 [ERROR FileContent–proc-sys-net-ipv4-ip_forward]
error execution phase preflight: [preflight] Some fatal errors occurred:
[ERROR FileContent--proc-sys-net-ipv4-ip_forward]: /proc/sys/net/ipv4/ip_forward contents are not set to 1
复制代码
执行 kubeadm init 报错 [WARNING FileExisting-socat]
[WARNING FileExisting-socat]: socat not found in system path
复制代码
执行 kubeadm init 报错 [WARNING Hostname]
[WARNING Hostname]: hostname "host-xx-xx-x-151" could not be reached
复制代码
执行 kubeadm join 报错 couldn’t validate the identity of the API Server
[preflight] Running pre-flight checks
error execution phase preflight: couldn't validate the identity of the API Server: failed to request the cluster-info ConfigMap: client rate limiter Wait returned an error: context deadline exceeded
To see the stack trace of this error execute with --v=5 or higher
复制代码
执行 kubectl describe node 报错 cni plugin not initialized
kubectl describe node host-xx-xx-x-151
======================================
。。。
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure False Sun, 20 Oct 2024 17:30:26 +0800 Sun, 20 Oct 2024 14:39:52 +0800 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Sun, 20 Oct 2024 17:30:26 +0800 Sun, 20 Oct 2024 14:39:52 +0800 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Sun, 20 Oct 2024 17:30:26 +0800 Sun, 20 Oct 2024 14:39:52 +0800 KubeletHasSufficientPID kubelet has sufficient PID available
Ready False Sun, 20 Oct 2024 17:30:26 +0800 Sun, 20 Oct 2024 14:39:52 +0800 KubeletNotReady container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized
复制代码
执行 kubectl get pods -n tidb-cluster 显示 basic-pd-0 状态为 pending
kubectl get pods -n tidb-cluster
NAME READY STATUS RESTARTS AGE
basic-discovery-85c8d6cd7f-wck48 1/1 Running 0 2m3s
basic-pd-0 0/1 Pending 0 2m3s
复制代码
参考 https://docs.pingcap.com/zh/tidb-in-kubernetes/stable/configure-storage-class#%E6%9C%AC%E5%9C%B0-pv-%E9%85%8D%E7%BD%AE 配置本地存储
注意,需要将 local-volume-provisioner.yaml 配置文件中的 image: “quay.io/external_storage/local-volume-provisioner:v2.3.4” 修改为 image: “quay.io/external_storage/local-volume-provisioner:v2.5.0“,这是因为 v2.3.4 版本过于老旧,可能遇到 no match for platform in manifest: not found 的报错。
##在各节点上创建目录并挂载
mkdir /data1/pdk8s/pd -p
mkdir /data1/tikvk8s/tikv -p
mkdir /data1/tidbk8s/tidb -p
mount --bind /data1/pdk8s/pd /data1/pdk8s/pd
mount --bind /data1/tikvk8s/tikv /data1/tikvk8s/tikv
mount --bind /data1/tidbk8s/tidb /data1/tidbk8s/tidb
##配置本地存储 pv
curl -o https://raw.githubusercontent.com/pingcap/tidb-operator/v1.6.0/examples/local-pv/local-volume-provisioner.yaml
vi local-volume-provisioner.yaml
kubectl delete -f local-volume-provisioner.yaml
kubectl apply -f local-volume-provisioner.yaml
kubectl get po -n kube-system -l app=local-volume-provisioner
================================================================
NAME READY STATUS RESTARTS AGE
local-volume-provisioner-8f7ms 1/1 Running 0 135m
local-volume-provisioner-xw7h7 1/1 Running 0 136m
local-volume-provisioner-zj27b 1/1 Running 0 8m58s
kubectl get pv
=============
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS VOLUMEATTRIBUTESCLASS REASON AGE
local-pv-1d87b555 2459Gi RWO Delete Available tidb-storage <unset> 2m12s
local-pv-1fe2627e 2459Gi RWO Delete Available pd-storage <unset> 2m12s
local-pv-4aba16db 2459Gi RWO Delete Available pd-storage <unset> 2m12s
local-pv-4e85cc9d 2459Gi RWO Delete Available tikv-storage <unset> 2m12s
local-pv-547cf652 2459Gi RWO Delete Available tikv-storage <unset> 2m12s
local-pv-6870ef87 2459Gi RWO Delete Available tidb-storage <unset> 2m12s
local-pv-89a42df0 2459Gi RWO Delete Available tikv-storage <unset> 2m12s
local-pv-a898b600 2459Gi RWO Delete Available pd-storage <unset> 2m12s
local-pv-b039092a 2459Gi RWO Delete Available tidb-storage <unset> 2m12s
复制代码
安装 TiDB 集群后 pd pod 状态为 ImagePullBackOff
kubectl describe pod advanced-tidb-pd-2 -n tidb-cluster
======================================================
。。。
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 19m default-scheduler Successfully assigned tidb-cluster/advanced-tidb-pd-2 to host-xx-xx-xx-151
Normal Pulling 18m (x4 over 19m) kubelet Pulling image "pingcap/pd:v8.1.0"
Warning Failed 18m (x4 over 19m) kubelet Failed to pull image "pingcap/pd:v8.1.0": failed to pull and unpack image "docker.io/pingcap/pd:v8.1.0": failed to resolve reference "docker.io/pingcap/pd:v8.1.0": failed to do request: Head "https://registry-1.docker.io/v2/pingcap/pd/manifests/v8.1.0": EOF
Warning Failed 18m (x4 over 19m) kubelet Error: ErrImagePull
Warning Failed 17m (x6 over 19m) kubelet Error: ImagePullBackOff
Normal BackOff 4m23s (x66 over 19m) kubelet Back-off pulling image "pingcap/pd:v8.1.0"
复制代码
执行 TiDB 初始化的 pod 状态为 Init:Error 报错 standard_init_linux.go:219
kubectl get pods -n tidb-cluster
================================
NAME READY STATUS RESTARTS AGE
advanced-tidb-discovery-b8ddc49c5-pm2l6 1/1 Running 0 70m
advanced-tidb-pd-0 1/1 Running 0 70m
advanced-tidb-pd-1 1/1 Running 0 70m
advanced-tidb-pd-2 1/1 Running 0 70m
advanced-tidb-tidb-0 2/2 Running 0 65m
advanced-tidb-tidb-1 2/2 Running 0 65m
advanced-tidb-tidb-2 2/2 Running 0 67m
advanced-tidb-tidb-initializer-k5hjb 0/1 Init:Error 0 70s
advanced-tidb-tikv-0 1/1 Running 0 70m
advanced-tidb-tikv-1 1/1 Running 0 70m
advanced-tidb-tikv-2 1/1 Running 0 70m
kubectl logs advanced-tidb-tidb-initializer-k5hjb -n tidb-cluster
=================================================================
Defaulted container "mysql-client" out of: mysql-client, wait (init)
Error from server (BadRequest): container "mysql-client" in pod "advanced-tidb-tidb-initializer-k5hjb" is waiting to start: PodInitializing
kubectl logs advanced-tidb-tidb-initializer-47x5r -n tidb-cluster -c wait
=========================================================================
standard_init_linux.go:219: exec user process caused "exec format error"
libcontainer: container start initialization failed: standard_init_linux.go:219: exec user process caused "exec format error"
复制代码
评论