记一次 TIDB 开启 TLS 失败导致 PD 扩容失败案例
- 2024-07-12 北京
本文字数:4276 字
阅读完需:约 14 分钟
作者: Dora 原文来源:https://tidb.net/blog/8ee8f295
问题背景
集群之前由于 TIUP 目录被删除导致 TLLS 证书丢失,后续需要重新开启 TLS
在测试环境测试 TLS 开启步骤,导致后续两台 PD 扩容失败,步骤如下:
缩容两台 PD
开启 TLS
扩容原有的两台 PD,最后 PD 启动的时候报错
集群 Restart
集群重启后导致三台 PD 全部 Down 机,需要 pd-recover 恢复或者销毁集群重建恢复
注:TLS 开启需要只保留一台 PD,若有多台 PD 需要先缩容成 1 台
排查过程
TIUP 日志排查
查看 TIUP 的日志,看到开启 TLS 的时候有报错
PD 日志排查
某一台 PD 启动的时候是 http 而不是 https
问题大致明朗,在开启 TLS 的时候因为报错导致后续 PD 的配置有问题
复现过程
缩容 PD
tiup cluster scale-in tidb-cc -N 172.16.201.159:52379
开启 TLS
开启 TLS 的时候发现最后一步是修改 PD 的配置信息
+ [ Serial ] - Reload PD Members Update pd-172.16.201.73-52379 peerURLs [https://172.16.201.73:52380]
[tidb@vm172-16-201-73 /tidb-deploy/cc/pd-52379/scripts]$ tiup cluster tls tidb-cc enable
tiup is checking updates for component cluster ...
A new version of cluster is available:
The latest version: v1.14.1
Local installed version: v1.12.3
Update current component: tiup update cluster
Update all components: tiup update --all
Starting component `cluster`: /home/tidb/.tiup/components/cluster/v1.12.3/tiup-cluster tls tidb-cc enable
Enable/Disable TLS will stop and restart the cluster `tidb-cc`
Do you want to continue? [y/N]:(default=N) y
Generate certificate: /home/tidb/.tiup/storage/cluster/clusters/tidb-cc/tls
+ [ Serial ] - SSHKeySet: privateKey=/home/tidb/.tiup/storage/cluster/clusters/tidb-cc/ssh/id_rsa, publicKey=/home/tidb/.tiup/storage/cluster/clusters/tidb-cc/ssh/id_rsa.pub
+ [Parallel] - UserSSH: user=tidb, host=172.16.201.73
+ [Parallel] - UserSSH: user=tidb, host=172.16.201.159
+ [Parallel] - UserSSH: user=tidb, host=172.16.201.73
+ [Parallel] - UserSSH: user=tidb, host=172.16.201.73
+ [Parallel] - UserSSH: user=tidb, host=172.16.201.159
+ [Parallel] - UserSSH: user=tidb, host=172.16.201.99
+ [Parallel] - UserSSH: user=tidb, host=172.16.201.73
+ [Parallel] - UserSSH: user=tidb, host=172.16.201.73
+ [Parallel] - UserSSH: user=tidb, host=172.16.201.73
+ Copy certificate to remote host
- Generate certificate pd -> 172.16.201.73:52379 ... Done
- Generate certificate tikv -> 172.16.201.73:25160 ... Done
- Generate certificate tikv -> 172.16.201.159:25160 ... Done
- Generate certificate tikv -> 172.16.201.99:25160 ... Done
- Generate certificate tidb -> 172.16.201.73:54000 ... Done
- Generate certificate tidb -> 172.16.201.159:54000 ... Done
- Generate certificate prometheus -> 172.16.201.73:59090 ... Done
- Generate certificate grafana -> 172.16.201.73:43000 ... Done
- Generate certificate alertmanager -> 172.16.201.73:59093 ... Done
+ Copy monitor certificate to remote host
- Generate certificate node_exporter -> 172.16.201.159 ... Done
- Generate certificate node_exporter -> 172.16.201.99 ... Done
- Generate certificate node_exporter -> 172.16.201.73 ... Done
- Generate certificate blackbox_exporter -> 172.16.201.73 ... Done
- Generate certificate blackbox_exporter -> 172.16.201.159 ... Done
- Generate certificate blackbox_exporter -> 172.16.201.99 ... Done
+ Refresh instance configs
- Generate config pd -> 172.16.201.73:52379 ... Done
- Generate config tikv -> 172.16.201.73:25160 ... Done
- Generate config tikv -> 172.16.201.159:25160 ... Done
- Generate config tikv -> 172.16.201.99:25160 ... Done
- Generate config tidb -> 172.16.201.73:54000 ... Done
- Generate config tidb -> 172.16.201.159:54000 ... Done
- Generate config prometheus -> 172.16.201.73:59090 ... Done
- Generate config grafana -> 172.16.201.73:43000 ... Done
- Generate config alertmanager -> 172.16.201.73:59093 ... Done
+ Refresh monitor configs
- Generate config node_exporter -> 172.16.201.73 ... Done
- Generate config node_exporter -> 172.16.201.159 ... Done
- Generate config node_exporter -> 172.16.201.99 ... Done
- Generate config blackbox_exporter -> 172.16.201.73 ... Done
- Generate config blackbox_exporter -> 172.16.201.159 ... Done
- Generate config blackbox_exporter -> 172.16.201.99 ... Done
+ [ Serial ] - Save meta
+ [ Serial ] - Restart Cluster
Stopping component alertmanager
Stopping instance 172.16.201.73
Stop alertmanager 172.16.201.73:59093 success
Stopping component grafana
Stopping instance 172.16.201.73
Stop grafana 172.16.201.73:43000 success
Stopping component prometheus
Stopping instance 172.16.201.73
Stop prometheus 172.16.201.73:59090 success
Stopping component tidb
Stopping instance 172.16.201.159
Stopping instance 172.16.201.73
Stop tidb 172.16.201.159:54000 success
Stop tidb 172.16.201.73:54000 success
Stopping component tikv
Stopping instance 172.16.201.99
Stopping instance 172.16.201.73
Stopping instance 172.16.201.159
Stop tikv 172.16.201.73:25160 success
Stop tikv 172.16.201.99:25160 success
Stop tikv 172.16.201.159:25160 success
Stopping component pd
Stopping instance 172.16.201.73
Stop pd 172.16.201.73:52379 success
Stopping component node_exporter
Stopping instance 172.16.201.99
Stopping instance 172.16.201.73
Stopping instance 172.16.201.159
Stop 172.16.201.73 success
Stop 172.16.201.99 success
Stop 172.16.201.159 success
Stopping component blackbox_exporter
Stopping instance 172.16.201.99
Stopping instance 172.16.201.73
Stopping instance 172.16.201.159
Stop 172.16.201.73 success
Stop 172.16.201.99 success
Stop 172.16.201.159 success
Starting component pd
Starting instance 172.16.201.73:52379
Start instance 172.16.201.73:52379 success
Starting component tikv
Starting instance 172.16.201.99:25160
Starting instance 172.16.201.73:25160
Starting instance 172.16.201.159:25160
Start instance 172.16.201.99:25160 success
Start instance 172.16.201.73:25160 success
Start instance 172.16.201.159:25160 success
Starting component tidb
Starting instance 172.16.201.159:54000
Starting instance 172.16.201.73:54000
Start instance 172.16.201.159:54000 success
Start instance 172.16.201.73:54000 success
Starting component prometheus
Starting instance 172.16.201.73:59090
Start instance 172.16.201.73:59090 success
Starting component grafana
Starting instance 172.16.201.73:43000
Start instance 172.16.201.73:43000 success
Starting component alertmanager
Starting instance 172.16.201.73:59093
Start instance 172.16.201.73:59093 success
Starting component node_exporter
Starting instance 172.16.201.159
Starting instance 172.16.201.99
Starting instance 172.16.201.73
Start 172.16.201.73 success
Start 172.16.201.99 success
Start 172.16.201.159 success
Starting component blackbox_exporter
Starting instance 172.16.201.159
Starting instance 172.16.201.99
Starting instance 172.16.201.73
Start 172.16.201.73 success
Start 172.16.201.99 success
Start 172.16.201.159 success
+ [ Serial ] - Reload PD Members
Update pd-172.16.201.73-52379 peerURLs: [https://172.16.201.73:52380]
Enabled TLS between TiDB components for cluster `tidb-cc` successfully
模拟开启 TLS 的时候停止 node_exports 失败
查看 member 信息
[tidb@vm172-16-201-73 ~]$ tiup ctl:v7.1.1 pd -u https://172.16.201.73:52379 --cacert=/home/tidb/.tiup/storage/cluster/clusters/tidb-cc/tls/ca.crt --key=/home/tidb/.tiup/storage/cluster/clusters/tidb-cc/tls/client.pem --cert=/home/tidb/.tiup/storage/cluster/clusters/tidb-cc/tls/client.crt member
{
"header": {
"cluster_id": 7299695983639846267
},
"members": [
{
"name": "pd-172.16.201.73-52379",
"member_id": 781118713925753452,
"peer_urls": [
"http://172.16.201.73:52380"
],
"client_urls": [
"https://172.16.201.73:52379"
],
发现 members 中的 peer_urls 确实是 http 的而非 https,复现成功
总结
开启 TLS 的时候失败,因为停止 node_export 失败,认为无关紧要,所以继续接下来的步骤
开启 TLS 最后一步是 pd member 的配置更新,停止 node_export 失败导致这一步没有正常执行
后续需要确保 TLS 开启成功后才能做下一步,也可以使用 pd-ctl 查看 member 的情况
如果后续遇到此类情况,可以先关闭 TLS,再次开启,开启的时候还遇到停止 node_export 时间过长,可以 tiup 执行时增加 –wait-timeout 参数以及手动 kill 机器的 node_export 进程,确保 TLS 继续进行
版权声明: 本文为 InfoQ 作者【TiDB 社区干货传送门】的原创文章。
原文链接:【http://xie.infoq.cn/article/22b38f9fe2ca3486dffcc8423】。文章转载请联系作者。
TiDB 社区干货传送门
TiDB 社区官网:https://tidb.net/ 2021-12-15 加入
TiDB 社区干货传送门是由 TiDB 社区中布道师组委会自发组织的 TiDB 社区优质内容对外宣布的栏目,旨在加深 TiDBer 之间的交流和学习。一起构建有爱、互助、共创共建的 TiDB 社区 https://tidb.net/
评论