写点什么

记一次 TIDB 开启 TLS 失败导致 PD 扩容失败案例

  • 2024-07-12
    北京
  • 本文字数:4276 字

    阅读完需:约 14 分钟

作者: Dora 原文来源:https://tidb.net/blog/8ee8f295

问题背景

  1. 集群之前由于 TIUP 目录被删除导致 TLLS 证书丢失,后续需要重新开启 TLS

  2. 在测试环境测试 TLS 开启步骤,导致后续两台 PD 扩容失败,步骤如下:

  3. 缩容两台 PD

  4. 开启 TLS

  5. 扩容原有的两台 PD,最后 PD 启动的时候报错

  6. 集群 Restart


集群重启后导致三台 PD 全部 Down 机,需要 pd-recover 恢复或者销毁集群重建恢复


注:TLS 开启需要只保留一台 PD,若有多台 PD 需要先缩容成 1 台

排查过程

TIUP 日志排查

  1. 查看 TIUP 的日志,看到开启 TLS 的时候有报错

PD 日志排查

某一台 PD 启动的时候是 http 而不是 https



问题大致明朗,在开启 TLS 的时候因为报错导致后续 PD 的配置有问题

复现过程

缩容 PD

tiup cluster scale-in tidb-cc -N 172.16.201.159:52379
复制代码

开启 TLS

开启 TLS 的时候发现最后一步是修改 PD 的配置信息


+ [ Serial ] - Reload PD Members Update pd-172.16.201.73-52379 peerURLs [https://172.16.201.73:52380]


[tidb@vm172-16-201-73 /tidb-deploy/cc/pd-52379/scripts]$ tiup cluster tls tidb-cc enabletiup is checking updates for component cluster ...A new version of cluster is available:   The latest version:         v1.14.1   Local installed version:    v1.12.3   Update current component:   tiup update cluster   Update all components:      tiup update --all
Starting component `cluster`: /home/tidb/.tiup/components/cluster/v1.12.3/tiup-cluster tls tidb-cc enableEnable/Disable TLS will stop and restart the cluster `tidb-cc`Do you want to continue? [y/N]:(default=N) yGenerate certificate: /home/tidb/.tiup/storage/cluster/clusters/tidb-cc/tls+ [ Serial ] - SSHKeySet: privateKey=/home/tidb/.tiup/storage/cluster/clusters/tidb-cc/ssh/id_rsa, publicKey=/home/tidb/.tiup/storage/cluster/clusters/tidb-cc/ssh/id_rsa.pub+ [Parallel] - UserSSH: user=tidb, host=172.16.201.73+ [Parallel] - UserSSH: user=tidb, host=172.16.201.159+ [Parallel] - UserSSH: user=tidb, host=172.16.201.73+ [Parallel] - UserSSH: user=tidb, host=172.16.201.73+ [Parallel] - UserSSH: user=tidb, host=172.16.201.159+ [Parallel] - UserSSH: user=tidb, host=172.16.201.99+ [Parallel] - UserSSH: user=tidb, host=172.16.201.73+ [Parallel] - UserSSH: user=tidb, host=172.16.201.73+ [Parallel] - UserSSH: user=tidb, host=172.16.201.73+ Copy certificate to remote host - Generate certificate pd -> 172.16.201.73:52379 ... Done - Generate certificate tikv -> 172.16.201.73:25160 ... Done - Generate certificate tikv -> 172.16.201.159:25160 ... Done - Generate certificate tikv -> 172.16.201.99:25160 ... Done - Generate certificate tidb -> 172.16.201.73:54000 ... Done - Generate certificate tidb -> 172.16.201.159:54000 ... Done - Generate certificate prometheus -> 172.16.201.73:59090 ... Done - Generate certificate grafana -> 172.16.201.73:43000 ... Done - Generate certificate alertmanager -> 172.16.201.73:59093 ... Done+ Copy monitor certificate to remote host - Generate certificate node_exporter -> 172.16.201.159 ... Done - Generate certificate node_exporter -> 172.16.201.99 ... Done - Generate certificate node_exporter -> 172.16.201.73 ... Done - Generate certificate blackbox_exporter -> 172.16.201.73 ... Done - Generate certificate blackbox_exporter -> 172.16.201.159 ... Done - Generate certificate blackbox_exporter -> 172.16.201.99 ... Done+ Refresh instance configs - Generate config pd -> 172.16.201.73:52379 ... Done - Generate config tikv -> 172.16.201.73:25160 ... Done - Generate config tikv -> 172.16.201.159:25160 ... Done - Generate config tikv -> 172.16.201.99:25160 ... Done - Generate config tidb -> 172.16.201.73:54000 ... Done - Generate config tidb -> 172.16.201.159:54000 ... Done - Generate config prometheus -> 172.16.201.73:59090 ... Done - Generate config grafana -> 172.16.201.73:43000 ... Done - Generate config alertmanager -> 172.16.201.73:59093 ... Done+ Refresh monitor configs - Generate config node_exporter -> 172.16.201.73 ... Done - Generate config node_exporter -> 172.16.201.159 ... Done - Generate config node_exporter -> 172.16.201.99 ... Done - Generate config blackbox_exporter -> 172.16.201.73 ... Done - Generate config blackbox_exporter -> 172.16.201.159 ... Done - Generate config blackbox_exporter -> 172.16.201.99 ... Done+ [ Serial ] - Save meta+ [ Serial ] - Restart ClusterStopping component alertmanager Stopping instance 172.16.201.73 Stop alertmanager 172.16.201.73:59093 successStopping component grafana Stopping instance 172.16.201.73 Stop grafana 172.16.201.73:43000 successStopping component prometheus Stopping instance 172.16.201.73 Stop prometheus 172.16.201.73:59090 successStopping component tidb Stopping instance 172.16.201.159 Stopping instance 172.16.201.73 Stop tidb 172.16.201.159:54000 success Stop tidb 172.16.201.73:54000 successStopping component tikv Stopping instance 172.16.201.99 Stopping instance 172.16.201.73 Stopping instance 172.16.201.159 Stop tikv 172.16.201.73:25160 success Stop tikv 172.16.201.99:25160 success Stop tikv 172.16.201.159:25160 successStopping component pd Stopping instance 172.16.201.73 Stop pd 172.16.201.73:52379 successStopping component node_exporter Stopping instance 172.16.201.99 Stopping instance 172.16.201.73 Stopping instance 172.16.201.159 Stop 172.16.201.73 success Stop 172.16.201.99 success Stop 172.16.201.159 successStopping component blackbox_exporter Stopping instance 172.16.201.99 Stopping instance 172.16.201.73 Stopping instance 172.16.201.159 Stop 172.16.201.73 success Stop 172.16.201.99 success Stop 172.16.201.159 successStarting component pd Starting instance 172.16.201.73:52379 Start instance 172.16.201.73:52379 successStarting component tikv Starting instance 172.16.201.99:25160 Starting instance 172.16.201.73:25160 Starting instance 172.16.201.159:25160 Start instance 172.16.201.99:25160 success Start instance 172.16.201.73:25160 success Start instance 172.16.201.159:25160 successStarting component tidb Starting instance 172.16.201.159:54000 Starting instance 172.16.201.73:54000 Start instance 172.16.201.159:54000 success Start instance 172.16.201.73:54000 successStarting component prometheus Starting instance 172.16.201.73:59090 Start instance 172.16.201.73:59090 successStarting component grafana Starting instance 172.16.201.73:43000 Start instance 172.16.201.73:43000 successStarting component alertmanager Starting instance 172.16.201.73:59093 Start instance 172.16.201.73:59093 successStarting component node_exporter Starting instance 172.16.201.159 Starting instance 172.16.201.99 Starting instance 172.16.201.73 Start 172.16.201.73 success Start 172.16.201.99 success Start 172.16.201.159 successStarting component blackbox_exporter Starting instance 172.16.201.159 Starting instance 172.16.201.99 Starting instance 172.16.201.73 Start 172.16.201.73 success Start 172.16.201.99 success Start 172.16.201.159 success+ [ Serial ] - Reload PD Members Update pd-172.16.201.73-52379 peerURLs: [https://172.16.201.73:52380]Enabled TLS between TiDB components for cluster `tidb-cc` successfully
复制代码


模拟开启 TLS 的时候停止 node_exports 失败


查看 member 信息


[tidb@vm172-16-201-73 ~]$ tiup ctl:v7.1.1 pd -u https://172.16.201.73:52379 --cacert=/home/tidb/.tiup/storage/cluster/clusters/tidb-cc/tls/ca.crt --key=/home/tidb/.tiup/storage/cluster/clusters/tidb-cc/tls/client.pem --cert=/home/tidb/.tiup/storage/cluster/clusters/tidb-cc/tls/client.crt member
{ "header": { "cluster_id": 7299695983639846267 }, "members": [ { "name": "pd-172.16.201.73-52379", "member_id": 781118713925753452, "peer_urls": [ "http://172.16.201.73:52380" ], "client_urls": [ "https://172.16.201.73:52379" ],
复制代码


发现 members 中的 peer_urls 确实是 http 的而非 https,复现成功

总结

  1. 开启 TLS 的时候失败,因为停止 node_export 失败,认为无关紧要,所以继续接下来的步骤

  2. 开启 TLS 最后一步是 pd member 的配置更新,停止 node_export 失败导致这一步没有正常执行

  3. 后续需要确保 TLS 开启成功后才能做下一步,也可以使用 pd-ctl 查看 member 的情况

  4. 如果后续遇到此类情况,可以先关闭 TLS,再次开启,开启的时候还遇到停止 node_export 时间过长,可以 tiup 执行时增加 –wait-timeout 参数以及手动 kill 机器的 node_export 进程,确保 TLS 继续进行


发布于: 刚刚阅读数: 3
用户头像

TiDB 社区官网:https://tidb.net/ 2021-12-15 加入

TiDB 社区干货传送门是由 TiDB 社区中布道师组委会自发组织的 TiDB 社区优质内容对外宣布的栏目,旨在加深 TiDBer 之间的交流和学习。一起构建有爱、互助、共创共建的 TiDB 社区 https://tidb.net/

评论

发布
暂无评论
记一次TIDB开启TLS失败导致PD扩容失败案例_实践案例_TiDB 社区干货传送门_InfoQ写作社区