kubernetes 集群升级,备份,故障恢复(kubeadm)

用户头像
小小文
关注
发布于: 2020 年 07 月 20 日

1、集群版本升级

总共5台机器,3台master 2台node节点

系统:CentOS Linux release 7.5 内存:8G 磁盘:50G

最小化安装

10.103.22.231 master01 haproxy keepalived
10.103.22.232 master02 haproxy keepalived
10.103.22.233 master03 haproxy keepalived
10.103.22.234 node04
10.103.22.235 node05

目前集群已经就绪当前版本为:v1.17.8 升级为v1.18.5(集群的安装参考上一篇文章)

etcd(v3.4.9)为二进制安装不做升级

升级前后版本对比:



API Server v1.17.8 v1.18.5
Controller Manager v1.17.8 v1.18.5
Scheduler v1.17.8 v1.18.5
Kube Proxy v1.17.8 v1.18.5
Kubelet v1.17.8 v1.18.5
Kubectl v1.17.8 v1.18.5
Kubeadm v1.17.8 v1.18.5
CoreDNS 1.6.5 1.6.7

1.1 master01升级

安装最新版本的 kubeadm kubelet:

export VERSION=1.18.5 # or manually specify a released Kubernetes version
export ARCH=amd64 # or: arm, arm64, ppc64le, s390x
curl -sSL https://dl.k8s.io/release/${VERSION}/bin/linux/${ARCH}/kubeadm > /usr/bin/kubeadm
chmod a+rx /usr/bin/kubeadm

#此处网络不通建议下载kubernetes-node 1.18.5的包,拉取里面的kubeadm的命令到/usr/bin路径下即可
#查看kubeadm版本
kubeadm version
kubeadm version: &version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.5", GitCommit:"e6503f8d8f769ace2f338794c914a96fc335df0f", GitTreeState:"clean", BuildDate:"2020-06-26T03:45:16Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}

#yum安装
yum install -y kubeadm-1.18.5 kubectl-1.18.5 kubelet-1.18.

节点驱逐

如果你的 master 节点也当作 node 在跑一些工作负载,则需要执行以下命令驱逐这些 pod 并使节点进入维护模式(禁止调度)。

# 将 NODE_NAME 换成 Master 节点名称
kubectl drain NODE_NAME --ignore-daemonsets
kubectl drain master01 --ignore-daemonsets

node/master01 cordoned
evicting pod "coredns-6955765f44-2m8sl"
evicting pod "coredns-6955765f44-t8v9d"
pod/coredns-6955765f44-2m8sl evicted
pod/coredns-6955765f44-t8v9d evicted
node/master01 evicted

#查看node情况
kubectl get nodes

NAME STATUS ROLES AGE VERSION
master01 Ready,SchedulingDisabled master 6h37m v1.18.5
master02 Ready master 6h23m v1.18.5
master03 Ready master 6h26m v1.17.8
node04 Ready <none> 6h13m v1.17.8
node05 Ready <none> 6h17m v1.17

 

查看升级计划

可以通过以下命令查看升级计划;升级计划中列出了升级期间要执行的所有步骤以及相关警告,一定要仔细查看。

kubeadm upgrade plan v1.18.5

[upgrade/config] Making sure the configuration is correct:
[upgrade/config] Reading configuration from the cluster...
[upgrade/config] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
[preflight] Running pre-flight checks.
[upgrade] Running cluster health checks
[upgrade] Fetching available versions to upgrade to
[upgrade/versions] Cluster version: v1.17.8
[upgrade/versions] kubeadm version: v1.18.5
[upgrade/versions] Latest stable version: v1.18.5
[upgrade/versions] Latest stable version: v1.18.5
[upgrade/versions] Latest version in the v1.17 series: v1.17.8
[upgrade/versions] Latest version in the v1.17 series: v1.17.8

External components that should be upgraded manually before you upgrade the control plane with 'kubeadm upgrade apply':
COMPONENT CURRENT AVAILABLE
Etcd 3.4.9 3.4.3-0

Components that must be upgraded manually after you have upgraded the control plane with 'kubeadm upgrade apply':
COMPONENT CURRENT AVAILABLE
Kubelet 5 x v1.17.8 v1.18.5

Upgrade to the latest stable version:

COMPONENT CURRENT AVAILABLE
API Server v1.17.8 v1.18.5
Controller Manager v1.17.8 v1.18.5
Scheduler v1.17.8 v1.18.5
Kube Proxy v1.17.8 v1.18.5
CoreDNS 1.6.5 1.6.7

You can now apply the upgrade by executing the following command:

kubeadm upgrade apply v

确认好升级计划以后,只需要一条命令既可将当前 master 节点升级到目标版本

调整配置文件k8s的版本

#调整kubeadm-init.yaml文件
vim kubaadm-init.yaml
apiVersion: kubeadm.k8s.io/v1beta2
kind: ClusterConfiguration
#kubernetesVersion: v1.17.8
kubernetesVersion: v1.18.5
#imageRepository: registry.aliyuncs.com/google_containers
controlPlaneEndpoint: 10.103.22.236:8443
apiServer:
certSANs:
- 10.103.22.231
- 10.103.22.232
- 10.103.22.233
- 10.103.22.236
- 127.0.0.1
etcd:
external:
endpoints:
- https://10.103.22.231:2379
- https://10.103.22.232:2379
- https://10.103.22.233:2379
caFile: /etc/kubernetes/pki/etcd/ca.pem
certFile: /etc/kubernetes/pki/etcd/etcd.pem
keyFile: /etc/kubernetes/pki/etcd/etcd-key.pem
networking:
dnsDomain: cluster.local
#serviceSubnet: 10.96.0.0/12
serviceSubnet: 192.96.0.1/16
podSubnet: 192.98.0.1/16
---
apiVersion: kubeproxy.config.k8s.io/v1alpha1
kind: KubeProxyConfiguration
mode: ipvs

执行升级命令

kubeadm upgrade apply v1.18.5 --config kubeadm-init.yaml
#也可以直接执行
kubeadm upgrade apply v1.18.5

W0717 16:00:23.537910 24663 configset.go:202] WARNING: kubeadm cannot validate component configs for API groups [kubelet.config.k8s.io kubeproxy.config.k8s.io]
[upgrade/config] Making sure the configuration is correct:
W0717 16:00:23.561047 24663 common.go:94] WARNING: Usage of the --config flag for reconfiguring the cluster during upgrade is not recommended!
W0717 16:00:23.562556 24663 configset.go:202] WARNING: kubeadm cannot validate component configs for API groups [kubelet.config.k8s.io kubeproxy.config.k8s.io]
[preflight] Running pre-flight checks.
[upgrade] Running cluster health checks
[upgrade/version] You have chosen to change the cluster version to "v1.18.5"
[upgrade/versions] Cluster version: v1.17.8
[upgrade/versions] kubeadm version: v1.18.5
[upgrade/confirm] Are you sure you want to proceed with the upgrade? [y/N]: y
[upgrade/prepull] Will prepull images for components [kube-apiserver kube-controller-manager kube-scheduler]
[upgrade/prepull] Prepulling image for component kube-scheduler.
[upgrade/prepull] Prepulling image for component kube-apiserver.
[upgrade/prepull] Prepulling image for component kube-controller-manager.
[apiclient] Found 3 Pods for label selector k8s-app=upgrade-prepull-kube-apiserver
[apiclient] Found 1 Pods for label selector k8s-app=upgrade-prepull-kube-controller-manager
[apiclient] Found 0 Pods for label selector k8s-app=upgrade-prepull-kube-scheduler
[apiclient] Found 3 Pods for label selector k8s-app=upgrade-prepull-kube-scheduler
[apiclient] Found 3 Pods for label selector k8s-app=upgrade-prepull-kube-controller-manager
[upgrade/prepull] Prepulled image for component kube-apiserver.
[upgrade/prepull] Prepulled image for component kube-controller-manager.
[upgrade/prepull] Prepulled image for component kube-scheduler.
[upgrade/prepull] Successfully prepulled the images for all the control plane components
[upgrade/apply] Upgrading your Static Pod-hosted control plane to version "v1.18.5"...
Static pod: kube-apiserver-master01 hash: ac5b27a793558b141550f6d29ce07c7b
Static pod: kube-controller-manager-master01 hash: 8055e9e49e89f81a31a76272b0be7213
Static pod: kube-scheduler-master01 hash: d152a6203d96ca8caf9351d3e5c0b743
[upgrade/staticpods] Writing new Static Pod manifests to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests540134147"
[upgrade/staticpods] Preparing for "kube-apiserver" upgrade
[upgrade/staticpods] Renewing apiserver certificate
[upgrade/staticpods] Renewing apiserver-kubelet-client certificate
[upgrade/staticpods] Renewing front-proxy-client certificate
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/kube-apiserver.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests-2020-07-17-16-00-43/kube-apiserver.yaml"
[upgrade/staticpods] Waiting for the kubelet to restart the component
[upgrade/staticpods] This might take a minute or longer depending on the component/version gap (timeout 5m0s)
Static pod: kube-apiserver-master01 hash: ac5b27a793558b141550f6d29ce07c7b
Static pod: kube-apiserver-master01 hash: ac5b27a793558b141550f6d29ce07c7b
Static pod: kube-apiserver-master01 hash: ac5b27a793558b141550f6d29ce07c7b
Static pod: kube-apiserver-master01 hash: 2164549833210857ac953cad7bed31c3
[apiclient] Found 3 Pods for label selector component=kube-apiserver
[upgrade/staticpods] Component "kube-apiserver" upgraded successfully!
[upgrade/staticpods] Preparing for "kube-controller-manager" upgrade
[upgrade/staticpods] Renewing controller-manager.conf certificate
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/kube-controller-manager.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests-2020-07-17-16-00-43/kube-controller-manager.yaml"
[upgrade/staticpods] Waiting for the kubelet to restart the component
[upgrade/staticpods] This might take a minute or longer depending on the component/version gap (timeout 5m0s)
Static pod: kube-controller-manager-master01 hash: 8055e9e49e89f81a31a76272b0be7213
Static pod: kube-controller-manager-master01 hash: 28ab4b90a4349f032601b7702c459bca
[apiclient] Found 3 Pods for label selector component=kube-controller-manager
[upgrade/staticpods] Component "kube-controller-manager" upgraded successfully!
[upgrade/staticpods] Preparing for "kube-scheduler" upgrade
[upgrade/staticpods] Renewing scheduler.conf certificate
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/kube-scheduler.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests-2020-07-17-16-00-43/kube-scheduler.yaml"
[upgrade/staticpods] Waiting for the kubelet to restart the component
[upgrade/staticpods] This might take a minute or longer depending on the component/version gap (timeout 5m0s)
Static pod: kube-scheduler-master01 hash: d152a6203d96ca8caf9351d3e5c0b743
Static pod: kube-scheduler-master01 hash: 2185f16ef8b60b3ef814c66cb76287c2
[apiclient] Found 3 Pods for label selector component=kube-scheduler
[upgrade/staticpods] Component "kube-scheduler" upgraded successfully!
[upload-config] Storing the configuration used in ConfigMap "kubeadm-config" in the "kube-system" Namespace
[kubelet] Creating a ConfigMap "kubelet-config-1.18" in namespace kube-system with the configuration for the kubelets in the cluster
[kubelet-start] Downloading configuration for the kubelet from the "kubelet-config-1.18" ConfigMap in the kube-system namespace
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[bootstrap-token] configured RBAC rules to allow Node Bootstrap tokens to get nodes
[bootstrap-token] configured RBAC rules to allow Node Bootstrap tokens to post CSRs in order for nodes to get long term certificate credentials
[bootstrap-token] configured RBAC rules to allow the csrapprover controller automatically approve CSRs from a Node Bootstrap Token
[bootstrap-token] configured RBAC rules to allow certificate rotation for all node client certificates in the cluster
[addons] Applied essential addon: CoreDNS
[endpoint] WARNING: port specified in controlPlaneEndpoint overrides bindPort in the controlplane address
[addons] Applied essential addon: kube-proxy

[upgrade/successful] SUCCESS! Your cluster was upgraded to "v1.18.5". Enjoy!

[upgrade/kubelet] Now that your control plane is upgraded, please proceed with upgrading your kubelets if you haven't already done s

重启kubelet

systemctl daemon-reload && systemctl restart kubelet

1.2 master02升级

#驱逐节点
kubectl drain master02 --ignore-daemonsets
#升级kubeadm kubectl kubelet
yum install -y kubeadm-1.18.5 kubectl-1.18.5 kubelet-1.18.5
#执行升级命令
kubeadm upgrade node
#重启kubelet
systemctl daemon-reload && systemctl restart kubelet
#恢复节点
kubectl uncordon master02

node/master02 uncordoned

1.3 master03升级

#驱逐节点
kubectl drain master03 --ignore-daemonsets

WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-xcjft, kube-system/kube-proxy-4r9d7
node/master03 drained

#升级kubeadm kubectl kubelet
yum install -y kubeadm-1.18.5 kubectl-1.18.5 kubelet-1.18.5
#执行升级命令
kubeadm upgrade node
#重启kubelet
systemctl daemon-reload && systemctl restart kubelet
#恢复节点
kubectl uncordon master03

node/master03 uncordon

1.4 node04升级

#驱逐节点
kubectl drain node04 --ignore-daemonsets

node/node04 cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-f4dx8, kube-system/kube-proxy-zb8th
evicting pod default/tomcat-7989d99887-bn9c7
evicting pod kube-system/coredns-684f7f6cb4-96qpj
pod/tomcat-7989d99887-bn9c7 evicted
pod/coredns-684f7f6cb4-96qpj evicted
node/node04 evicted

#升级kubeadm kubectl kubelet
yum install -y kubeadm-1.18.5 kubectl-1.18.5 kubelet-1.18.5
#执行升级命令
kubeadm upgrade node
#重启kubelet
systemctl daemon-reload && systemctl restart kubelet
#恢复节点
kubectl uncordon node0

1.5 node05升级

#驱逐节点
kubectl drain node05 --ignore-daemonsets

node/node05 cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-sw5lf, kube-system/kube-proxy-lm68f
evicting pod default/tomcat-7989d99887-2rtp9
evicting pod kube-system/calico-kube-controllers-f64c4b8fb-f7cjw
evicting pod default/centos-node05-758c6ddfc8-858m7
evicting pod kube-system/coredns-684f7f6cb4-q9b9g
pod/calico-kube-controllers-f64c4b8fb-f7cjw evicted
pod/coredns-684f7f6cb4-q9b9g evicted
pod/tomcat-7989d99887-2rtp9 evicted
pod/centos-node05-758c6ddfc8-858m7 evicted
node/node05 evicted

#升级kubeadm kubectl kubelet
yum install -y kubeadm-1.18.5 kubectl-1.18.5 kubelet-1.18.5
#执行升级命令
kubeadm upgrade node
#重启kubelet
systemctl daemon-reload && systemctl restart kubelet
#恢复节点
kubectl uncordon node0

1.6 查看集群升级情况

kubectl get node

NAME STATUS ROLES AGE VERSION
master01 Ready master 7h16m v1.18.5
master02 Ready master 7h2m v1.18.5
master03 Ready master 7h5m v1.18.5
node04 Ready <none> 6h52m v1.18.5
node05 Ready <none> 6h56m v1.18.5
已经由 v1.17.8升级到v1.18.5

2、kubernetes故障处理

2.1 模拟etcd故障并恢复etcd集群

模拟故障

#删除etcd的数据目录
rm -rf /opt/lib/etcd/member
#日志报错
Jul 18 21:38:14 master03 etcd: store.index: compact 357144
Jul 18 21:38:14 master03 etcd: finished scheduled compaction at 357144 (took 74.15366ms)
Jul 18 21:38:53 master03 systemd-logind: New session 280 of user root.
Jul 18 21:38:53 master03 systemd: Started Session 280 of user root.
Jul 18 21:38:53 master03 systemd: Starting Session 280 of user root.
Jul 18 21:39:19 master03 etcd: failed to purge snap db file open /opt/lib/etcd/member/snap: no such file or directory
Jul 18 21:39:19 master03 systemd: etcd.service: main process exited, code=exited, status=1/FAILURE
Jul 18 21:39:19 master03 systemd: Unit etcd.service entered failed state.
Jul 18 21:39:19 master03 systemd: etcd.service failed.
Jul 18 21:39:25 master03 systemd: etcd.service holdoff time over, scheduling restart.
Jul 18 21:39:25 master03 systemd: Starting Etcd Server...
Jul 18 21:39:25 master03 etcd: [WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
Jul 18 21:39:25 master03 etcd: etcd Version: 3.4.9
Jul 18 21:39:25 master03 etcd: [WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
Jul 18 21:39:25 master03 etcd: Git SHA: 54ba95891
Jul 18 21:39:25 master03 etcd: Go Version: go1.12.17
Jul 18 21:39:25 master03 etcd: Go OS/Arch: linux/amd64
Jul 18 21:39:25 master03 etcd: setting maximum number of CPUs to 8, total number of available CPUs is 8
Jul 18 21:39:25 master03 etcd: found invalid file/dir member.bak under data dir /opt/lib/etcd (Ignore this if you are upgrading etcd)
Jul 18 21:39:25 master03 etcd: peerTLS: cert = /etc/kubernetes/pki/etcd/etcd.pem, key = /etc/kubernetes/pki/etcd/etcd-key.pem, trusted-ca = /etc/kubernetes/pki/etcd/ca.pem, client-cert-auth
= true, crl-file =

#显示无法找到数据文件
etcd: failed to purge snap db file open /opt/lib/etcd/member/snap: no such file or directory

恢复集群

#查看k8s集群状态
kubectl get cs

etcd-2 Unhealthy Get https://10.103.22.233:2379/health: dial tcp 10.103.22.233:2379: connect: connection refused
scheduler Healthy ok
controller-manager Healthy ok
etcd-0 Healthy {"health":"true"}
etcd-1 Healthy {"health":"true"}
#显示etcd-2节点已经不可用

# 列出etcd所有节点
etcdctl member list
3a71c9f1368858b8, started, etcd0, https://10.103.22.231:2380, https://10.103.22.231:2379, false
5cb39603b2e01de0, started, etcd1, https://10.103.22.232:2380, https://10.103.22.232:2379, false
9540dd3ad80b71ff, started, etcd2, https://10.103.22.233:2380, https://10.103.22.233:2379, false

#查看节点状态
NODE_IPS=("10.103.22.231" "10.103.22.232" "10.103.22.233")
for node_ip in ${NODE_IPS[@]};do
echo ">>> ${node_ip}"
ETCDCTL_API=3 /opt/kubernetes/bin/etcdctl \
--endpoints=https://$node_ip:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.pem \
--cert=/etc/kubernetes/pki/etcd/etcd.pem \
--key=/etc/kubernetes/pki/etcd/etcd-key.pem endpoint health
done

# 删除故障节点
etcdctl member remove 9540dd3ad80b71ff

#修复故障节点(故障节点上操作)
#编辑etcd.service 启动文件
vim /etc/systemd/system/etcd.service
调整--initial-cluster-state=new 为 --initial-cluster-state=existing

#重新添加节点(正常节点上操作),将etcd2节点重新加入集群
etcdctl member add etcd2 --peer-urls=https://10.103.22.233:2380

Member 4e32023da83ac173 added to cluster a5ff38a50855db37
ETCD_NAME="etcd2"
ETCD_INITIAL_CLUSTER="etcd0=https://10.103.22.231:2380,etcd2=https://10.103.22.233:2380,etcd1=https://10.103.22.232:2380"
ETCD_INITIAL_ADVERTISE_PEER_URLS="https://10.103.22.233:2380"
ETCD_INITIAL_CLUSTER_STATE="existing"
添加成功

#重启故障节点(故障节点上操作)
systemctl restart etcd

Jul 18 21:58:47 master03 etcd: name = etcd2
Jul 18 21:58:47 master03 etcd: data dir = /opt/lib/etcd
Jul 18 21:58:47 master03 etcd: member dir = /opt/lib/etcd/member
Jul 18 21:58:47 master03 etcd: heartbeat = 100ms
Jul 18 21:58:47 master03 etcd: election = 1000ms
Jul 18 21:58:47 master03 etcd: snapshot count = 100000
Jul 18 21:58:47 master03 etcd: advertise client URLs = https://10.103.22.233:2379
Jul 18 21:58:47 master03 etcd: starting member 4e32023da83ac173 in cluster a5ff38a50855db37
Jul 18 21:58:47 master03 etcd: raft2020/07/18 21:58:47 INFO: 4e32023da83ac173 switched to configuration voters=()
Jul 18 21:58:47 master03 etcd: raft2020/07/18 21:58:47 INFO: 4e32023da83ac173 became follower at term 0
Jul 18 21:58:47 master03 etcd: raft2020/07/18 21:58:47 INFO: newRaft 4e32023da83ac173 [peers: [], term: 0, commit: 0, applied: 0, lastindex: 0, lastterm: 0]
Jul 18 21:58:47 master03 etcd: simple token is not cryptographically signed
Jul 18 21:58:47 master03 etcd: started HTTP pipelining with peer 3a71c9f1368858b8
Jul 18 21:58:47 master03 etcd: started HTTP pipelining with peer 5cb39603b2e01de0
Jul 18 21:58:47 master03 etcd: starting peer 3a71c9f1368858b8...
Jul 18 21:58:47 master03 etcd: started HTTP pipelining with peer 3a71c9f1368858b8
Jul 18 21:58:47 master03 etcd: started streaming with peer 3a71c9f1368858b8 (writer)
Jul 18 21:58:47 master03 etcd: started streaming with peer 3a71c9f1368858b8 (writer)
Jul 18 21:58:47 master03 etcd: started peer 3a71c9f1368858b8
Jul 18 21:58:47 master03 etcd: added peer 3a71c9f1368858b8
Jul 18 21:58:47 master03 etcd: started streaming with peer 3a71c9f1368858b8 (stream Message reader)
Jul 18 21:58:47 master03 etcd: started streaming with peer 3a71c9f1368858b8 (stream MsgApp v2 reader)
Jul 18 21:58:47 master03 etcd: starting peer 5cb39603b2e01de0...
Jul 18 21:58:47 master03 etcd: started HTTP pipelining with peer 5cb39603b2e01de0
Jul 18 21:58:47 master03 etcd: started streaming with peer 5cb39603b2e01de0 (writer)
Jul 18 21:58:47 master03 etcd: started peer 5cb39603b2e01de0
Jul 18 21:58:47 master03 etcd: added peer 5cb39603b2e01de0
Jul 18 21:58:47 master03 etcd: started streaming with peer 5cb39603b2e01de0 (stream MsgApp v2 reader)
Jul 18 21:58:47 master03 etcd: started streaming with peer 5cb39603b2e01de0 (stream Message reader)
Jul 18 21:58:47 master03 etcd: starting server... [version: 3.4.9, cluster version: to_be_decided]
Jul 18 21:58:47 master03 etcd: started streaming with peer 5cb39603b2e01de0 (writer)
Jul 18 21:58:47 master03 etcd: ClientTLS: cert = /etc/kubernetes/pki/etcd/etcd.pem, key = /etc/kubernetes/pki/etcd/etcd-key.pem, trusted-ca = /etc/kubernetes/pki/etcd/ca.pem, client-cert-au
th = true, crl-file =
Jul 18 21:58:47 master03 etcd: listening for peers on 10.103.22.233:2380
Jul 18 21:58:47 master03 etcd: peer 5cb39603b2e01de0 became active
查看启动日志没有问题

#查看恢复后的etcd状态
etcdctl member list
3a71c9f1368858b8, started, etcd0, https://10.103.22.231:2380, https://10.103.22.231:2379, false
4e32023da83ac173, started, etcd2, https://10.103.22.233:2380, https://10.103.22.233:2379, false
5cb39603b2e01de0, started, etcd1, https://10.103.22.232:2380, https://10.103.22.232:2379, false
3个节点都已经启动
#查看k8s 集群状态
kubectl get cs
NAME STATUS MESSAGE ERROR
scheduler Healthy ok
controller-manager Healthy ok
etcd-0 Healthy {"health":"true"}
etcd-2 Healthy {"health":"true"}
etcd-1 Healthy {"health":"true"}

#查看节点状态
NODE_IPS=("10.103.22.231" "10.103.22.232" "10.103.22.233")
for node_ip in ${NODE_IPS[@]};do
echo ">>> ${node_ip}"
ETCDCTL_API=3 /opt/kubernetes/bin/etcdctl \
--endpoints=https://$node_ip:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.pem \
--cert=/etc/kubernetes/pki/etcd/etcd.pem \
--key=/etc/kubernetes/pki/etcd/etcd-key.pem endpoint health
done

>>> 10.103.22.231
https://10.103.22.231:2379 is healthy: successfully committed proposal: took = 36.375259ms
>>> 10.103.22.232
https://10.103.22.232:2379 is healthy: successfully committed proposal: took = 22.921767ms
>>> 10.103.22.233
https://10.103.22.233:2379 is healthy: successfully committed proposal: took = 19.169257ms
etcd状态恢复

 

2.2 模拟kubernetes故障并恢复集群

在master03节点上执行如下命令重置节点
kubeadm reset -f
查看node信息
kubectl get nodes
NAME STATUS ROLES AGE VERSION
master01 Ready master 3d4h v1.18.5
master02 Ready master 3d3h v1.18.5
master03 NotReady master 3d3h v1.18.5
node04 Ready <none> 3d3h v1.18.5
node05 Ready <none> 3d3h v1.18.5

显示master03节点 NotReady

重新加入集群

先删除节点
kubectl delete node master03
node "master03" deleted

kubectl get nodes
NAME STATUS ROLES AGE VERSION
master01 Ready master 3d4h v1.18.5
master02 Ready master 3d4h v1.18.5
node04 Ready <none> 3d3h v1.18.5
node05 Ready <none> 3d3h v1.18.5
节点master03已经删除

查看在master03节点运行的pod
kubectl get pods -n kube-system -o wide |grep master03
calico-node-xcjft 1/1 Running 0 3d4h 10.103.22.233 master03 <none> <none>
kube-apiserver-master03 1/1 Terminating 0 2d21h 10.103.22.233 master03 <none> <none>
kube-controller-manager-master03 1/1 Terminating 0 2d21h 10.103.22.233 master03 <none> <none>
kube-proxy-4r9d7 1/1 Running 0 2d21h 10.103.22.233 master03 <none> <none>
kube-scheduler-master03 1/1 Terminating 0 2d21h 10.103.22.233 master03 <none> <none>
目前都在停止,后续会全部停掉

获取新的token
kubeadm token create --print-join-command
kubeadm join 10.103.22.236:8443 --token ljfifa.s04hny45lrmreynk --discovery-token-ca-cert-hash sha256:b5cf22ae5d3b745f5351127e6cfc78a3d1dfd52ee8454a07e25216d0dfc9e8a6
获取新的certificate-key
kubeadm init phase upload-certs --upload-certs --config kubeadm-init.yaml
W0720 14:24:47.205467 16728 configset.go:202] WARNING: kubeadm cannot validate component configs for API groups [kubelet.config.k8s.io kubeproxy.config.k8s.io]
[upload-certs] Storing the certificates in Secret "kubeadm-certs" in the "kube-system" Namespace
[upload-certs] Using certificate key:
78e210b13d14480895d445d15b4185085f2dcc1a7480ed663e8cd0d45d5a3d4f

重新加入集群
kubeadm join 10.103.22.236:8443 --token ljfifa.s04hny45lrmreynk \
--discovery-token-ca-cert-hash sha256:b5cf22ae5d3b745f5351127e6cfc78a3d1dfd52ee8454a07e25216d0dfc9e8a6 \
--control-plane --certificate-key 78e210b13d14480895d445d15b4185085f2dcc1a7480ed663e8cd0d45d5a3d4f

[preflight] Running pre-flight checks
[preflight] Reading configuration from the cluster...
[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
[preflight] Running pre-flight checks before initializing the new control plane instance
[preflight] Pulling images required for setting up a Kubernetes cluster
[preflight] This might take a minute or two, depending on the speed of your internet connection
[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'
[download-certs] Downloading the certificates in Secret "kubeadm-certs" in the "kube-system" Namespace
[certs] Using certificateDir folder "/etc/kubernetes/pki"
[certs] Generating "ca" certificate and key
[certs] Generating "apiserver" certificate and key
[certs] apiserver serving cert is signed for DNS names [master03 kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local] and IPs [192.96.0.1 10.103.22.233 10.103.22.236 10.103.22.231 10.103.22.232 10.103.22.233 10.103.22.236 127.0.0.1]
[certs] Generating "apiserver-kubelet-client" certificate and key
[certs] Generating "front-proxy-ca" certificate and key
[certs] Generating "front-proxy-client" certificate and key
[certs] Valid certificates and keys now exist in "/etc/kubernetes/pki"
[certs] Generating "sa" key and public key
[kubeconfig] Generating kubeconfig files
[kubeconfig] Using kubeconfig folder "/etc/kubernetes"
[endpoint] WARNING: port specified in controlPlaneEndpoint overrides bindPort in the controlplane address
error execution phase control-plane-prepare/kubeconfig: error generating kubeconfig files: a kubeconfig file "/etc/kubernetes/admin.conf" exists already but has got the wrong CA cert
To see the stack trace of this error execute with --v=5 or higher
[root@master03 ~]# rm /etc/kubernetes/admin.conf
rm: remove regular file ‘/etc/kubernetes/admin.conf’? y
[root@master03 ~]# kubeadm join 10.103.22.236:8443 --token ljfifa.s04hny45lrmreynk --discovery-token-ca-cert-hash sha256:b5cf22ae5d3b745f5351127e6cfc78a3d1dfd52ee8454a07e25216d0dfc9e8a6 --control-plane --certificate-key 78e210b13d14480895d445d15b4185085f2dcc1a7480ed663e8cd0d45d5a3d4f
[preflight] Running pre-flight checks
[preflight] Reading configuration from the cluster...
[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
[preflight] Running pre-flight checks before initializing the new control plane instance
[preflight] Pulling images required for setting up a Kubernetes cluster
[preflight] This might take a minute or two, depending on the speed of your internet connection
[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'
[download-certs] Downloading the certificates in Secret "kubeadm-certs" in the "kube-system" Namespace
[certs] Using certificateDir folder "/etc/kubernetes/pki"
[certs] Using the existing "apiserver-kubelet-client" certificate and key
[certs] Using the existing "apiserver" certificate and key
[certs] Using the existing "front-proxy-client" certificate and key
[certs] Valid certificates and keys now exist in "/etc/kubernetes/pki"
[certs] Using the existing "sa" key
[kubeconfig] Generating kubeconfig files
[kubeconfig] Using kubeconfig folder "/etc/kubernetes"
[endpoint] WARNING: port specified in controlPlaneEndpoint overrides bindPort in the controlplane address
[kubeconfig] Writing "admin.conf" kubeconfig file
[kubeconfig] Writing "controller-manager.conf" kubeconfig file
[kubeconfig] Writing "scheduler.conf" kubeconfig file
[control-plane] Using manifest folder "/etc/kubernetes/manifests"
[control-plane] Creating static Pod manifest for "kube-apiserver"
W0720 14:26:01.528901 16975 manifests.go:225] the default kube-apiserver authorization-mode is "Node,RBAC"; using "Node,RBAC"
[control-plane] Creating static Pod manifest for "kube-controller-manager"
W0720 14:26:01.546734 16975 manifests.go:225] the default kube-apiserver authorization-mode is "Node,RBAC"; using "Node,RBAC"
[control-plane] Creating static Pod manifest for "kube-scheduler"
W0720 14:26:01.549638 16975 manifests.go:225] the default kube-apiserver authorization-mode is "Node,RBAC"; using "Node,RBAC"
[check-etcd] Skipping etcd check in external mode
[kubelet-start] Downloading configuration for the kubelet from the "kubelet-config-1.18" ConfigMap in the kube-system namespace
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Starting the kubelet
[kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap...
[control-plane-join] using external etcd - no local stacked instance added
[upload-config] Storing the configuration used in ConfigMap "kubeadm-config" in the "kube-system" Namespace
[mark-control-plane] Marking the node master03 as control-plane by adding the label "node-role.kubernetes.io/master=''"
[mark-control-plane] Marking the node master03 as control-plane by adding the taints [node-role.kubernetes.io/master:NoSchedule]

This node has joined the cluster and a new control plane instance was created:

* Certificate signing request was sent to apiserver and approval was received.
* The Kubelet was informed of the new secure connection details.
* Control plane (master) label and taint were applied to the new node.
* The Kubernetes control plane instances scaled up.
To start administering your cluster from this node, you need to run the following as a regular user:
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
Run 'kubectl get nodes' to see this node join the cluster.

kubectl get nodes
NAME STATUS ROLES AGE VERSION
master01 Ready master 3d4h v1.18.5
master02 Ready master 3d4h v1.18.5
master03 Ready master 11m v1.18.5
node04 Ready <none> 3d4h v1.18.5
node05 Ready <none> 3d4h

加入集群成刚需要更换一下kubectl命令的config配置文件

如果是master03机器故障无法恢复,可以重新准备节点加入集群即可
安装参考之前文档

2.3 模拟node节点故障并恢复

重置node05

kubeadm reset -f
查看node信息
NAME STATUS ROLES AGE VERSION
master01 Ready master 3d5h v1.18.5
master02 Ready master 3d4h v1.18.5
master03 Ready master 17m v1.18.5
node04 Ready <none> 3d4h v1.18.5
node05 NotReady <none> 3d4h v1.18.5

显示node05节点 NotReady


重新加入集群

先删除节点
kubectl delete node node05
node "node05" deleted

kubectl get nodes
NAME STATUS ROLES AGE VERSION
master01 Ready master 3d5h v1.18.5
master02 Ready master 3d4h v1.18.5
master03 Ready master 19m v1.18.5
node04 Ready <none> 3d4h v1.18.5
节点master03已经删

查看node5上的pod,已经全部清理掉了
kubectl get pods -n kube-system -o wide |grep node05

获取新的token
kubeadm token create --print-join-command
kubeadm join 10.103.22.236:8443 --token ljfifa.s04hny45lrmreynk --discovery-token-ca-cert-hash sha256:b5cf22ae5d3b745f5351127e6cfc78a3d1dfd52ee8454a07e25216d0dfc9e8

重新加入集群
kubeadm join 10.103.22.236:8443 --token ljfifa.s04hny45lrmreynk \
--discovery-token-ca-cert-hash sha256:b5cf22ae5d3b745f5351127e6cfc78a3d1dfd52ee8454a07e25216d0dfc9e8a6

[preflight] Running pre-flight checks
[preflight] Reading configuration from the cluster...
[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
[kubelet-start] Downloading configuration for the kubelet from the "kubelet-config-1.18" ConfigMap in the kube-system namespace
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Starting the kubelet
[kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap...

This node has joined the cluster:
* Certificate signing request was sent to apiserver and a response was received.
* The Kubelet was informed of the new secure connection details.

Run 'kubectl get nodes' on the control-plane to see this node join the cluster.

查看节点信息
kubectl get nodes
NAME STATUS ROLES AGE VERSION
master01 Ready master 3d5h v1.18.5
master02 Ready master 3d4h v1.18.5
master03 Ready master 25m v1.18.5
node04 Ready <none> 3d4h v1.18.5
node05 Ready <none> 60s v1.18.5
node05节点已经恢复

如果是node05机器故障无法恢复,可以重新准备节点加入集群即可
安装参考之前文档



用户头像

小小文

关注

还未添加个人签名 2018.02.06 加入

还未添加个人简介

评论

发布
暂无评论
kubernetes 集群升级,备份,故障恢复(kubeadm)