kubernetes 集群升级,备份,故障恢复(kubeadm)
发布于: 2020 年 07 月 20 日
1、集群版本升级
总共5台机器,3台master 2台node节点
系统:CentOS Linux release 7.5 内存:8G 磁盘:50G
最小化安装
10.103.22.231 master01 haproxy keepalived10.103.22.232 master02 haproxy keepalived10.103.22.233 master03 haproxy keepalived10.103.22.234 node0410.103.22.235 node05
目前集群已经就绪当前版本为:v1.17.8 升级为v1.18.5(集群的安装参考上一篇文章)
etcd(v3.4.9)为二进制安装不做升级
升级前后版本对比:
API Server v1.17.8 v1.18.5Controller Manager v1.17.8 v1.18.5Scheduler v1.17.8 v1.18.5Kube Proxy v1.17.8 v1.18.5Kubelet v1.17.8 v1.18.5Kubectl v1.17.8 v1.18.5Kubeadm v1.17.8 v1.18.5CoreDNS 1.6.5 1.6.7
1.1 master01升级
安装最新版本的 kubeadm kubelet:
export VERSION=1.18.5 # or manually specify a released Kubernetes versionexport ARCH=amd64 # or: arm, arm64, ppc64le, s390xcurl -sSL https://dl.k8s.io/release/${VERSION}/bin/linux/${ARCH}/kubeadm > /usr/bin/kubeadmchmod a+rx /usr/bin/kubeadm#此处网络不通建议下载kubernetes-node 1.18.5的包,拉取里面的kubeadm的命令到/usr/bin路径下即可#查看kubeadm版本kubeadm versionkubeadm version: &version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.5", GitCommit:"e6503f8d8f769ace2f338794c914a96fc335df0f", GitTreeState:"clean", BuildDate:"2020-06-26T03:45:16Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}#yum安装yum install -y kubeadm-1.18.5 kubectl-1.18.5 kubelet-1.18.
节点驱逐
如果你的 master 节点也当作 node 在跑一些工作负载,则需要执行以下命令驱逐这些 pod 并使节点进入维护模式(禁止调度)。
# 将 NODE_NAME 换成 Master 节点名称kubectl drain NODE_NAME --ignore-daemonsetskubectl drain master01 --ignore-daemonsetsnode/master01 cordonedevicting pod "coredns-6955765f44-2m8sl"evicting pod "coredns-6955765f44-t8v9d"pod/coredns-6955765f44-2m8sl evictedpod/coredns-6955765f44-t8v9d evictednode/master01 evicted#查看node情况kubectl get nodesNAME STATUS ROLES AGE VERSIONmaster01 Ready,SchedulingDisabled master 6h37m v1.18.5master02 Ready master 6h23m v1.18.5master03 Ready master 6h26m v1.17.8node04 Ready <none> 6h13m v1.17.8node05 Ready <none> 6h17m v1.17
查看升级计划
可以通过以下命令查看升级计划;升级计划中列出了升级期间要执行的所有步骤以及相关警告,一定要仔细查看。
kubeadm upgrade plan v1.18.5[upgrade/config] Making sure the configuration is correct:[upgrade/config] Reading configuration from the cluster...[upgrade/config] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'[preflight] Running pre-flight checks.[upgrade] Running cluster health checks[upgrade] Fetching available versions to upgrade to[upgrade/versions] Cluster version: v1.17.8[upgrade/versions] kubeadm version: v1.18.5[upgrade/versions] Latest stable version: v1.18.5[upgrade/versions] Latest stable version: v1.18.5[upgrade/versions] Latest version in the v1.17 series: v1.17.8[upgrade/versions] Latest version in the v1.17 series: v1.17.8External components that should be upgraded manually before you upgrade the control plane with 'kubeadm upgrade apply':COMPONENT CURRENT AVAILABLEEtcd 3.4.9 3.4.3-0Components that must be upgraded manually after you have upgraded the control plane with 'kubeadm upgrade apply':COMPONENT CURRENT AVAILABLEKubelet 5 x v1.17.8 v1.18.5Upgrade to the latest stable version:COMPONENT CURRENT AVAILABLEAPI Server v1.17.8 v1.18.5Controller Manager v1.17.8 v1.18.5Scheduler v1.17.8 v1.18.5Kube Proxy v1.17.8 v1.18.5CoreDNS 1.6.5 1.6.7You can now apply the upgrade by executing the following command: kubeadm upgrade apply v
确认好升级计划以后,只需要一条命令既可将当前 master 节点升级到目标版本
调整配置文件k8s的版本
#调整kubeadm-init.yaml文件vim kubaadm-init.yamlapiVersion: kubeadm.k8s.io/v1beta2kind: ClusterConfiguration#kubernetesVersion: v1.17.8kubernetesVersion: v1.18.5#imageRepository: registry.aliyuncs.com/google_containerscontrolPlaneEndpoint: 10.103.22.236:8443apiServer: certSANs: - 10.103.22.231 - 10.103.22.232 - 10.103.22.233 - 10.103.22.236 - 127.0.0.1etcd: external: endpoints: - https://10.103.22.231:2379 - https://10.103.22.232:2379 - https://10.103.22.233:2379 caFile: /etc/kubernetes/pki/etcd/ca.pem certFile: /etc/kubernetes/pki/etcd/etcd.pem keyFile: /etc/kubernetes/pki/etcd/etcd-key.pemnetworking: dnsDomain: cluster.local #serviceSubnet: 10.96.0.0/12 serviceSubnet: 192.96.0.1/16 podSubnet: 192.98.0.1/16---apiVersion: kubeproxy.config.k8s.io/v1alpha1kind: KubeProxyConfigurationmode: ipvs
执行升级命令
kubeadm upgrade apply v1.18.5 --config kubeadm-init.yaml#也可以直接执行kubeadm upgrade apply v1.18.5W0717 16:00:23.537910 24663 configset.go:202] WARNING: kubeadm cannot validate component configs for API groups [kubelet.config.k8s.io kubeproxy.config.k8s.io][upgrade/config] Making sure the configuration is correct:W0717 16:00:23.561047 24663 common.go:94] WARNING: Usage of the --config flag for reconfiguring the cluster during upgrade is not recommended!W0717 16:00:23.562556 24663 configset.go:202] WARNING: kubeadm cannot validate component configs for API groups [kubelet.config.k8s.io kubeproxy.config.k8s.io][preflight] Running pre-flight checks.[upgrade] Running cluster health checks[upgrade/version] You have chosen to change the cluster version to "v1.18.5"[upgrade/versions] Cluster version: v1.17.8[upgrade/versions] kubeadm version: v1.18.5[upgrade/confirm] Are you sure you want to proceed with the upgrade? [y/N]: y[upgrade/prepull] Will prepull images for components [kube-apiserver kube-controller-manager kube-scheduler][upgrade/prepull] Prepulling image for component kube-scheduler.[upgrade/prepull] Prepulling image for component kube-apiserver.[upgrade/prepull] Prepulling image for component kube-controller-manager.[apiclient] Found 3 Pods for label selector k8s-app=upgrade-prepull-kube-apiserver[apiclient] Found 1 Pods for label selector k8s-app=upgrade-prepull-kube-controller-manager[apiclient] Found 0 Pods for label selector k8s-app=upgrade-prepull-kube-scheduler[apiclient] Found 3 Pods for label selector k8s-app=upgrade-prepull-kube-scheduler[apiclient] Found 3 Pods for label selector k8s-app=upgrade-prepull-kube-controller-manager[upgrade/prepull] Prepulled image for component kube-apiserver.[upgrade/prepull] Prepulled image for component kube-controller-manager.[upgrade/prepull] Prepulled image for component kube-scheduler.[upgrade/prepull] Successfully prepulled the images for all the control plane components[upgrade/apply] Upgrading your Static Pod-hosted control plane to version "v1.18.5"...Static pod: kube-apiserver-master01 hash: ac5b27a793558b141550f6d29ce07c7bStatic pod: kube-controller-manager-master01 hash: 8055e9e49e89f81a31a76272b0be7213Static pod: kube-scheduler-master01 hash: d152a6203d96ca8caf9351d3e5c0b743[upgrade/staticpods] Writing new Static Pod manifests to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests540134147"[upgrade/staticpods] Preparing for "kube-apiserver" upgrade[upgrade/staticpods] Renewing apiserver certificate[upgrade/staticpods] Renewing apiserver-kubelet-client certificate[upgrade/staticpods] Renewing front-proxy-client certificate[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/kube-apiserver.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests-2020-07-17-16-00-43/kube-apiserver.yaml"[upgrade/staticpods] Waiting for the kubelet to restart the component[upgrade/staticpods] This might take a minute or longer depending on the component/version gap (timeout 5m0s)Static pod: kube-apiserver-master01 hash: ac5b27a793558b141550f6d29ce07c7bStatic pod: kube-apiserver-master01 hash: ac5b27a793558b141550f6d29ce07c7bStatic pod: kube-apiserver-master01 hash: ac5b27a793558b141550f6d29ce07c7bStatic pod: kube-apiserver-master01 hash: 2164549833210857ac953cad7bed31c3[apiclient] Found 3 Pods for label selector component=kube-apiserver[upgrade/staticpods] Component "kube-apiserver" upgraded successfully![upgrade/staticpods] Preparing for "kube-controller-manager" upgrade[upgrade/staticpods] Renewing controller-manager.conf certificate[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/kube-controller-manager.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests-2020-07-17-16-00-43/kube-controller-manager.yaml"[upgrade/staticpods] Waiting for the kubelet to restart the component[upgrade/staticpods] This might take a minute or longer depending on the component/version gap (timeout 5m0s)Static pod: kube-controller-manager-master01 hash: 8055e9e49e89f81a31a76272b0be7213Static pod: kube-controller-manager-master01 hash: 28ab4b90a4349f032601b7702c459bca[apiclient] Found 3 Pods for label selector component=kube-controller-manager[upgrade/staticpods] Component "kube-controller-manager" upgraded successfully![upgrade/staticpods] Preparing for "kube-scheduler" upgrade[upgrade/staticpods] Renewing scheduler.conf certificate[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/kube-scheduler.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests-2020-07-17-16-00-43/kube-scheduler.yaml"[upgrade/staticpods] Waiting for the kubelet to restart the component[upgrade/staticpods] This might take a minute or longer depending on the component/version gap (timeout 5m0s)Static pod: kube-scheduler-master01 hash: d152a6203d96ca8caf9351d3e5c0b743Static pod: kube-scheduler-master01 hash: 2185f16ef8b60b3ef814c66cb76287c2[apiclient] Found 3 Pods for label selector component=kube-scheduler[upgrade/staticpods] Component "kube-scheduler" upgraded successfully![upload-config] Storing the configuration used in ConfigMap "kubeadm-config" in the "kube-system" Namespace[kubelet] Creating a ConfigMap "kubelet-config-1.18" in namespace kube-system with the configuration for the kubelets in the cluster[kubelet-start] Downloading configuration for the kubelet from the "kubelet-config-1.18" ConfigMap in the kube-system namespace[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"[bootstrap-token] configured RBAC rules to allow Node Bootstrap tokens to get nodes[bootstrap-token] configured RBAC rules to allow Node Bootstrap tokens to post CSRs in order for nodes to get long term certificate credentials[bootstrap-token] configured RBAC rules to allow the csrapprover controller automatically approve CSRs from a Node Bootstrap Token[bootstrap-token] configured RBAC rules to allow certificate rotation for all node client certificates in the cluster[addons] Applied essential addon: CoreDNS[endpoint] WARNING: port specified in controlPlaneEndpoint overrides bindPort in the controlplane address[addons] Applied essential addon: kube-proxy[upgrade/successful] SUCCESS! Your cluster was upgraded to "v1.18.5". Enjoy![upgrade/kubelet] Now that your control plane is upgraded, please proceed with upgrading your kubelets if you haven't already done s
重启kubelet
systemctl daemon-reload && systemctl restart kubelet
1.2 master02升级
#驱逐节点kubectl drain master02 --ignore-daemonsets#升级kubeadm kubectl kubeletyum install -y kubeadm-1.18.5 kubectl-1.18.5 kubelet-1.18.5#执行升级命令kubeadm upgrade node#重启kubeletsystemctl daemon-reload && systemctl restart kubelet#恢复节点kubectl uncordon master02node/master02 uncordoned
1.3 master03升级
#驱逐节点kubectl drain master03 --ignore-daemonsetsWARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-xcjft, kube-system/kube-proxy-4r9d7node/master03 drained#升级kubeadm kubectl kubeletyum install -y kubeadm-1.18.5 kubectl-1.18.5 kubelet-1.18.5#执行升级命令kubeadm upgrade node#重启kubeletsystemctl daemon-reload && systemctl restart kubelet#恢复节点kubectl uncordon master03node/master03 uncordon
1.4 node04升级
#驱逐节点kubectl drain node04 --ignore-daemonsetsnode/node04 cordonedWARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-f4dx8, kube-system/kube-proxy-zb8thevicting pod default/tomcat-7989d99887-bn9c7evicting pod kube-system/coredns-684f7f6cb4-96qpjpod/tomcat-7989d99887-bn9c7 evictedpod/coredns-684f7f6cb4-96qpj evictednode/node04 evicted#升级kubeadm kubectl kubeletyum install -y kubeadm-1.18.5 kubectl-1.18.5 kubelet-1.18.5#执行升级命令kubeadm upgrade node#重启kubeletsystemctl daemon-reload && systemctl restart kubelet#恢复节点kubectl uncordon node0
1.5 node05升级
#驱逐节点kubectl drain node05 --ignore-daemonsetsnode/node05 cordonedWARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-sw5lf, kube-system/kube-proxy-lm68fevicting pod default/tomcat-7989d99887-2rtp9evicting pod kube-system/calico-kube-controllers-f64c4b8fb-f7cjwevicting pod default/centos-node05-758c6ddfc8-858m7evicting pod kube-system/coredns-684f7f6cb4-q9b9gpod/calico-kube-controllers-f64c4b8fb-f7cjw evictedpod/coredns-684f7f6cb4-q9b9g evictedpod/tomcat-7989d99887-2rtp9 evictedpod/centos-node05-758c6ddfc8-858m7 evictednode/node05 evicted#升级kubeadm kubectl kubeletyum install -y kubeadm-1.18.5 kubectl-1.18.5 kubelet-1.18.5#执行升级命令kubeadm upgrade node#重启kubeletsystemctl daemon-reload && systemctl restart kubelet#恢复节点kubectl uncordon node0
1.6 查看集群升级情况
kubectl get nodeNAME STATUS ROLES AGE VERSIONmaster01 Ready master 7h16m v1.18.5master02 Ready master 7h2m v1.18.5master03 Ready master 7h5m v1.18.5node04 Ready <none> 6h52m v1.18.5node05 Ready <none> 6h56m v1.18.5已经由 v1.17.8升级到v1.18.5
2、kubernetes故障处理
2.1 模拟etcd故障并恢复etcd集群
模拟故障
#删除etcd的数据目录rm -rf /opt/lib/etcd/member#日志报错Jul 18 21:38:14 master03 etcd: store.index: compact 357144Jul 18 21:38:14 master03 etcd: finished scheduled compaction at 357144 (took 74.15366ms)Jul 18 21:38:53 master03 systemd-logind: New session 280 of user root.Jul 18 21:38:53 master03 systemd: Started Session 280 of user root.Jul 18 21:38:53 master03 systemd: Starting Session 280 of user root.Jul 18 21:39:19 master03 etcd: failed to purge snap db file open /opt/lib/etcd/member/snap: no such file or directoryJul 18 21:39:19 master03 systemd: etcd.service: main process exited, code=exited, status=1/FAILUREJul 18 21:39:19 master03 systemd: Unit etcd.service entered failed state.Jul 18 21:39:19 master03 systemd: etcd.service failed.Jul 18 21:39:25 master03 systemd: etcd.service holdoff time over, scheduling restart.Jul 18 21:39:25 master03 systemd: Starting Etcd Server...Jul 18 21:39:25 master03 etcd: [WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag insteadJul 18 21:39:25 master03 etcd: etcd Version: 3.4.9Jul 18 21:39:25 master03 etcd: [WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag insteadJul 18 21:39:25 master03 etcd: Git SHA: 54ba95891Jul 18 21:39:25 master03 etcd: Go Version: go1.12.17Jul 18 21:39:25 master03 etcd: Go OS/Arch: linux/amd64Jul 18 21:39:25 master03 etcd: setting maximum number of CPUs to 8, total number of available CPUs is 8Jul 18 21:39:25 master03 etcd: found invalid file/dir member.bak under data dir /opt/lib/etcd (Ignore this if you are upgrading etcd)Jul 18 21:39:25 master03 etcd: peerTLS: cert = /etc/kubernetes/pki/etcd/etcd.pem, key = /etc/kubernetes/pki/etcd/etcd-key.pem, trusted-ca = /etc/kubernetes/pki/etcd/ca.pem, client-cert-auth = true, crl-file = #显示无法找到数据文件etcd: failed to purge snap db file open /opt/lib/etcd/member/snap: no such file or directory
恢复集群
#查看k8s集群状态kubectl get csetcd-2 Unhealthy Get https://10.103.22.233:2379/health: dial tcp 10.103.22.233:2379: connect: connection refused scheduler Healthy ok controller-manager Healthy ok etcd-0 Healthy {"health":"true"} etcd-1 Healthy {"health":"true"} #显示etcd-2节点已经不可用# 列出etcd所有节点etcdctl member list3a71c9f1368858b8, started, etcd0, https://10.103.22.231:2380, https://10.103.22.231:2379, false5cb39603b2e01de0, started, etcd1, https://10.103.22.232:2380, https://10.103.22.232:2379, false9540dd3ad80b71ff, started, etcd2, https://10.103.22.233:2380, https://10.103.22.233:2379, false#查看节点状态NODE_IPS=("10.103.22.231" "10.103.22.232" "10.103.22.233")for node_ip in ${NODE_IPS[@]};do echo ">>> ${node_ip}" ETCDCTL_API=3 /opt/kubernetes/bin/etcdctl \ --endpoints=https://$node_ip:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.pem \ --cert=/etc/kubernetes/pki/etcd/etcd.pem \ --key=/etc/kubernetes/pki/etcd/etcd-key.pem endpoint healthdone# 删除故障节点etcdctl member remove 9540dd3ad80b71ff#修复故障节点(故障节点上操作)#编辑etcd.service 启动文件vim /etc/systemd/system/etcd.service调整--initial-cluster-state=new 为 --initial-cluster-state=existing#重新添加节点(正常节点上操作),将etcd2节点重新加入集群etcdctl member add etcd2 --peer-urls=https://10.103.22.233:2380Member 4e32023da83ac173 added to cluster a5ff38a50855db37ETCD_NAME="etcd2"ETCD_INITIAL_CLUSTER="etcd0=https://10.103.22.231:2380,etcd2=https://10.103.22.233:2380,etcd1=https://10.103.22.232:2380"ETCD_INITIAL_ADVERTISE_PEER_URLS="https://10.103.22.233:2380"ETCD_INITIAL_CLUSTER_STATE="existing"添加成功#重启故障节点(故障节点上操作)systemctl restart etcdJul 18 21:58:47 master03 etcd: name = etcd2Jul 18 21:58:47 master03 etcd: data dir = /opt/lib/etcdJul 18 21:58:47 master03 etcd: member dir = /opt/lib/etcd/memberJul 18 21:58:47 master03 etcd: heartbeat = 100msJul 18 21:58:47 master03 etcd: election = 1000msJul 18 21:58:47 master03 etcd: snapshot count = 100000Jul 18 21:58:47 master03 etcd: advertise client URLs = https://10.103.22.233:2379Jul 18 21:58:47 master03 etcd: starting member 4e32023da83ac173 in cluster a5ff38a50855db37Jul 18 21:58:47 master03 etcd: raft2020/07/18 21:58:47 INFO: 4e32023da83ac173 switched to configuration voters=()Jul 18 21:58:47 master03 etcd: raft2020/07/18 21:58:47 INFO: 4e32023da83ac173 became follower at term 0Jul 18 21:58:47 master03 etcd: raft2020/07/18 21:58:47 INFO: newRaft 4e32023da83ac173 [peers: [], term: 0, commit: 0, applied: 0, lastindex: 0, lastterm: 0]Jul 18 21:58:47 master03 etcd: simple token is not cryptographically signedJul 18 21:58:47 master03 etcd: started HTTP pipelining with peer 3a71c9f1368858b8Jul 18 21:58:47 master03 etcd: started HTTP pipelining with peer 5cb39603b2e01de0Jul 18 21:58:47 master03 etcd: starting peer 3a71c9f1368858b8...Jul 18 21:58:47 master03 etcd: started HTTP pipelining with peer 3a71c9f1368858b8Jul 18 21:58:47 master03 etcd: started streaming with peer 3a71c9f1368858b8 (writer)Jul 18 21:58:47 master03 etcd: started streaming with peer 3a71c9f1368858b8 (writer)Jul 18 21:58:47 master03 etcd: started peer 3a71c9f1368858b8Jul 18 21:58:47 master03 etcd: added peer 3a71c9f1368858b8Jul 18 21:58:47 master03 etcd: started streaming with peer 3a71c9f1368858b8 (stream Message reader)Jul 18 21:58:47 master03 etcd: started streaming with peer 3a71c9f1368858b8 (stream MsgApp v2 reader)Jul 18 21:58:47 master03 etcd: starting peer 5cb39603b2e01de0...Jul 18 21:58:47 master03 etcd: started HTTP pipelining with peer 5cb39603b2e01de0Jul 18 21:58:47 master03 etcd: started streaming with peer 5cb39603b2e01de0 (writer)Jul 18 21:58:47 master03 etcd: started peer 5cb39603b2e01de0Jul 18 21:58:47 master03 etcd: added peer 5cb39603b2e01de0Jul 18 21:58:47 master03 etcd: started streaming with peer 5cb39603b2e01de0 (stream MsgApp v2 reader)Jul 18 21:58:47 master03 etcd: started streaming with peer 5cb39603b2e01de0 (stream Message reader)Jul 18 21:58:47 master03 etcd: starting server... [version: 3.4.9, cluster version: to_be_decided]Jul 18 21:58:47 master03 etcd: started streaming with peer 5cb39603b2e01de0 (writer)Jul 18 21:58:47 master03 etcd: ClientTLS: cert = /etc/kubernetes/pki/etcd/etcd.pem, key = /etc/kubernetes/pki/etcd/etcd-key.pem, trusted-ca = /etc/kubernetes/pki/etcd/ca.pem, client-cert-auth = true, crl-file = Jul 18 21:58:47 master03 etcd: listening for peers on 10.103.22.233:2380Jul 18 21:58:47 master03 etcd: peer 5cb39603b2e01de0 became active查看启动日志没有问题#查看恢复后的etcd状态etcdctl member list3a71c9f1368858b8, started, etcd0, https://10.103.22.231:2380, https://10.103.22.231:2379, false4e32023da83ac173, started, etcd2, https://10.103.22.233:2380, https://10.103.22.233:2379, false5cb39603b2e01de0, started, etcd1, https://10.103.22.232:2380, https://10.103.22.232:2379, false3个节点都已经启动#查看k8s 集群状态kubectl get csNAME STATUS MESSAGE ERRORscheduler Healthy ok controller-manager Healthy ok etcd-0 Healthy {"health":"true"} etcd-2 Healthy {"health":"true"} etcd-1 Healthy {"health":"true"} #查看节点状态NODE_IPS=("10.103.22.231" "10.103.22.232" "10.103.22.233")for node_ip in ${NODE_IPS[@]};do echo ">>> ${node_ip}" ETCDCTL_API=3 /opt/kubernetes/bin/etcdctl \ --endpoints=https://$node_ip:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.pem \ --cert=/etc/kubernetes/pki/etcd/etcd.pem \ --key=/etc/kubernetes/pki/etcd/etcd-key.pem endpoint healthdone>>> 10.103.22.231https://10.103.22.231:2379 is healthy: successfully committed proposal: took = 36.375259ms>>> 10.103.22.232https://10.103.22.232:2379 is healthy: successfully committed proposal: took = 22.921767ms>>> 10.103.22.233https://10.103.22.233:2379 is healthy: successfully committed proposal: took = 19.169257msetcd状态恢复
2.2 模拟kubernetes故障并恢复集群
在master03节点上执行如下命令重置节点kubeadm reset -f查看node信息kubectl get nodesNAME STATUS ROLES AGE VERSIONmaster01 Ready master 3d4h v1.18.5master02 Ready master 3d3h v1.18.5master03 NotReady master 3d3h v1.18.5node04 Ready <none> 3d3h v1.18.5node05 Ready <none> 3d3h v1.18.5显示master03节点 NotReady
重新加入集群
先删除节点kubectl delete node master03node "master03" deletedkubectl get nodesNAME STATUS ROLES AGE VERSIONmaster01 Ready master 3d4h v1.18.5master02 Ready master 3d4h v1.18.5node04 Ready <none> 3d3h v1.18.5node05 Ready <none> 3d3h v1.18.5节点master03已经删除查看在master03节点运行的podkubectl get pods -n kube-system -o wide |grep master03calico-node-xcjft 1/1 Running 0 3d4h 10.103.22.233 master03 <none> <none>kube-apiserver-master03 1/1 Terminating 0 2d21h 10.103.22.233 master03 <none> <none>kube-controller-manager-master03 1/1 Terminating 0 2d21h 10.103.22.233 master03 <none> <none>kube-proxy-4r9d7 1/1 Running 0 2d21h 10.103.22.233 master03 <none> <none>kube-scheduler-master03 1/1 Terminating 0 2d21h 10.103.22.233 master03 <none> <none>目前都在停止,后续会全部停掉获取新的tokenkubeadm token create --print-join-commandkubeadm join 10.103.22.236:8443 --token ljfifa.s04hny45lrmreynk --discovery-token-ca-cert-hash sha256:b5cf22ae5d3b745f5351127e6cfc78a3d1dfd52ee8454a07e25216d0dfc9e8a6 获取新的certificate-keykubeadm init phase upload-certs --upload-certs --config kubeadm-init.yamlW0720 14:24:47.205467 16728 configset.go:202] WARNING: kubeadm cannot validate component configs for API groups [kubelet.config.k8s.io kubeproxy.config.k8s.io][upload-certs] Storing the certificates in Secret "kubeadm-certs" in the "kube-system" Namespace[upload-certs] Using certificate key:78e210b13d14480895d445d15b4185085f2dcc1a7480ed663e8cd0d45d5a3d4f重新加入集群 kubeadm join 10.103.22.236:8443 --token ljfifa.s04hny45lrmreynk \ --discovery-token-ca-cert-hash sha256:b5cf22ae5d3b745f5351127e6cfc78a3d1dfd52ee8454a07e25216d0dfc9e8a6 \ --control-plane --certificate-key 78e210b13d14480895d445d15b4185085f2dcc1a7480ed663e8cd0d45d5a3d4f[preflight] Running pre-flight checks[preflight] Reading configuration from the cluster...[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'[preflight] Running pre-flight checks before initializing the new control plane instance[preflight] Pulling images required for setting up a Kubernetes cluster[preflight] This might take a minute or two, depending on the speed of your internet connection[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'[download-certs] Downloading the certificates in Secret "kubeadm-certs" in the "kube-system" Namespace[certs] Using certificateDir folder "/etc/kubernetes/pki"[certs] Generating "ca" certificate and key[certs] Generating "apiserver" certificate and key[certs] apiserver serving cert is signed for DNS names [master03 kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local] and IPs [192.96.0.1 10.103.22.233 10.103.22.236 10.103.22.231 10.103.22.232 10.103.22.233 10.103.22.236 127.0.0.1][certs] Generating "apiserver-kubelet-client" certificate and key[certs] Generating "front-proxy-ca" certificate and key[certs] Generating "front-proxy-client" certificate and key[certs] Valid certificates and keys now exist in "/etc/kubernetes/pki"[certs] Generating "sa" key and public key[kubeconfig] Generating kubeconfig files[kubeconfig] Using kubeconfig folder "/etc/kubernetes"[endpoint] WARNING: port specified in controlPlaneEndpoint overrides bindPort in the controlplane addresserror execution phase control-plane-prepare/kubeconfig: error generating kubeconfig files: a kubeconfig file "/etc/kubernetes/admin.conf" exists already but has got the wrong CA certTo see the stack trace of this error execute with --v=5 or higher[root@master03 ~]# rm /etc/kubernetes/admin.conf rm: remove regular file ‘/etc/kubernetes/admin.conf’? y[root@master03 ~]# kubeadm join 10.103.22.236:8443 --token ljfifa.s04hny45lrmreynk --discovery-token-ca-cert-hash sha256:b5cf22ae5d3b745f5351127e6cfc78a3d1dfd52ee8454a07e25216d0dfc9e8a6 --control-plane --certificate-key 78e210b13d14480895d445d15b4185085f2dcc1a7480ed663e8cd0d45d5a3d4f[preflight] Running pre-flight checks[preflight] Reading configuration from the cluster...[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'[preflight] Running pre-flight checks before initializing the new control plane instance[preflight] Pulling images required for setting up a Kubernetes cluster[preflight] This might take a minute or two, depending on the speed of your internet connection[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'[download-certs] Downloading the certificates in Secret "kubeadm-certs" in the "kube-system" Namespace[certs] Using certificateDir folder "/etc/kubernetes/pki"[certs] Using the existing "apiserver-kubelet-client" certificate and key[certs] Using the existing "apiserver" certificate and key[certs] Using the existing "front-proxy-client" certificate and key[certs] Valid certificates and keys now exist in "/etc/kubernetes/pki"[certs] Using the existing "sa" key[kubeconfig] Generating kubeconfig files[kubeconfig] Using kubeconfig folder "/etc/kubernetes"[endpoint] WARNING: port specified in controlPlaneEndpoint overrides bindPort in the controlplane address[kubeconfig] Writing "admin.conf" kubeconfig file[kubeconfig] Writing "controller-manager.conf" kubeconfig file[kubeconfig] Writing "scheduler.conf" kubeconfig file[control-plane] Using manifest folder "/etc/kubernetes/manifests"[control-plane] Creating static Pod manifest for "kube-apiserver"W0720 14:26:01.528901 16975 manifests.go:225] the default kube-apiserver authorization-mode is "Node,RBAC"; using "Node,RBAC"[control-plane] Creating static Pod manifest for "kube-controller-manager"W0720 14:26:01.546734 16975 manifests.go:225] the default kube-apiserver authorization-mode is "Node,RBAC"; using "Node,RBAC"[control-plane] Creating static Pod manifest for "kube-scheduler"W0720 14:26:01.549638 16975 manifests.go:225] the default kube-apiserver authorization-mode is "Node,RBAC"; using "Node,RBAC"[check-etcd] Skipping etcd check in external mode[kubelet-start] Downloading configuration for the kubelet from the "kubelet-config-1.18" ConfigMap in the kube-system namespace[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"[kubelet-start] Starting the kubelet[kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap...[control-plane-join] using external etcd - no local stacked instance added[upload-config] Storing the configuration used in ConfigMap "kubeadm-config" in the "kube-system" Namespace[mark-control-plane] Marking the node master03 as control-plane by adding the label "node-role.kubernetes.io/master=''"[mark-control-plane] Marking the node master03 as control-plane by adding the taints [node-role.kubernetes.io/master:NoSchedule]This node has joined the cluster and a new control plane instance was created:* Certificate signing request was sent to apiserver and approval was received.* The Kubelet was informed of the new secure connection details.* Control plane (master) label and taint were applied to the new node.* The Kubernetes control plane instances scaled up.To start administering your cluster from this node, you need to run the following as a regular user: mkdir -p $HOME/.kube sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config sudo chown $(id -u):$(id -g) $HOME/.kube/configRun 'kubectl get nodes' to see this node join the cluster.kubectl get nodesNAME STATUS ROLES AGE VERSIONmaster01 Ready master 3d4h v1.18.5master02 Ready master 3d4h v1.18.5master03 Ready master 11m v1.18.5node04 Ready <none> 3d4h v1.18.5node05 Ready <none> 3d4h 加入集群成刚需要更换一下kubectl命令的config配置文件如果是master03机器故障无法恢复,可以重新准备节点加入集群即可安装参考之前文档
2.3 模拟node节点故障并恢复
重置node05
kubeadm reset -f查看node信息NAME STATUS ROLES AGE VERSIONmaster01 Ready master 3d5h v1.18.5master02 Ready master 3d4h v1.18.5master03 Ready master 17m v1.18.5node04 Ready <none> 3d4h v1.18.5node05 NotReady <none> 3d4h v1.18.5显示node05节点 NotReady
重新加入集群
先删除节点kubectl delete node node05node "node05" deletedkubectl get nodesNAME STATUS ROLES AGE VERSIONmaster01 Ready master 3d5h v1.18.5master02 Ready master 3d4h v1.18.5master03 Ready master 19m v1.18.5node04 Ready <none> 3d4h v1.18.5节点master03已经删查看node5上的pod,已经全部清理掉了kubectl get pods -n kube-system -o wide |grep node05获取新的tokenkubeadm token create --print-join-commandkubeadm join 10.103.22.236:8443 --token ljfifa.s04hny45lrmreynk --discovery-token-ca-cert-hash sha256:b5cf22ae5d3b745f5351127e6cfc78a3d1dfd52ee8454a07e25216d0dfc9e8重新加入集群kubeadm join 10.103.22.236:8443 --token ljfifa.s04hny45lrmreynk \ --discovery-token-ca-cert-hash sha256:b5cf22ae5d3b745f5351127e6cfc78a3d1dfd52ee8454a07e25216d0dfc9e8a6 [preflight] Running pre-flight checks[preflight] Reading configuration from the cluster...[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'[kubelet-start] Downloading configuration for the kubelet from the "kubelet-config-1.18" ConfigMap in the kube-system namespace[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"[kubelet-start] Starting the kubelet[kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap...This node has joined the cluster:* Certificate signing request was sent to apiserver and a response was received.* The Kubelet was informed of the new secure connection details.Run 'kubectl get nodes' on the control-plane to see this node join the cluster.查看节点信息kubectl get nodesNAME STATUS ROLES AGE VERSIONmaster01 Ready master 3d5h v1.18.5master02 Ready master 3d4h v1.18.5master03 Ready master 25m v1.18.5node04 Ready <none> 3d4h v1.18.5node05 Ready <none> 60s v1.18.5node05节点已经恢复如果是node05机器故障无法恢复,可以重新准备节点加入集群即可安装参考之前文档
划线
评论
复制
发布于: 2020 年 07 月 20 日 阅读数: 161
小小文
关注
还未添加个人签名 2018.02.06 加入
还未添加个人简介
评论