pd-recovery 后部分 tikv 连接 pd 失败
作者: szy2010214517 原文来源:https://tidb.net/blog/2bbc4879
一、背景
集群所有 pd 节点的数据目录丢失,导致集群服务不可用,第一次通过 pd-recovery 恢复集群填入了错误的 cluster id,此时其中一台 tikv 的数据目录丢失,再次使用 pd-recovery 恢复整个集群后, 因丢失数据的 tikv 节点已经记录了 pd-recovery 错误的 cluster id, 导致该 tikv 节点无法正常加入到集群中。
二、环境描述
tidb 版本:3.0.8
ip: 集群 Roles 角色
192.168.50.193 pd + tikv + tidb-ansible + monitor
192.168.50.218 tidb + pd + tikv
192.168.50.135 tidb + pd + tikv
三、操作步骤
3.1 查询集群正常状态下的 cluster id
从 pd 节点获取 cluster id
[tidb\@tidb01 log]$ cat pd.log | grep “init cluster id”
[2021/01/07 10:44:54.437 +08:00] [INFO] [server.go:214] [“init cluster id”] [cluster-id=6914843637332658230]
从 pd 日志获取 allocates id
[tidb\@tidb01 log]$ cat pd* | grep “allocates”
[2021/01/07 10:44:56.464 +08:00] [INFO] [id.go:91] [“idAllocator allocates a new id”] [alloc-id=1000]
此时正确的 cluster id 是:6914843637332658230 , allocates id 是: 1000
3.2 模拟 pd 节点的数据文件丢失
删除三台 pd 节点上的 data.pd 文件目录
删除后 tikv.log 日志如下
[2021/01/07 10:51:41.235 +08:00] [INFO] [subchannel.cc:760] [“Subchannel 0x7f19eb3dc400: Retry in 999 milliseconds”]
[2021/01/07 10:51:41.236 +08:00] [ERROR] [util.rs:444] [“connect failed”] [err=“Grpc(RpcFailure(RpcStatus { status: Unavailable, details: Some(“Connect Failed”) }))”] [endpoints=http://192.168.50.218:2379]
[2021/01/07 10:51:41.236 +08:00] [INFO] [util.rs:397] [“connecting to PD endpoint”] [endpoints=http://192.168.50.135:2379]
[2021/01/07 10:51:41.236 +08:00] [INFO] [subchannel.cc:878] [“Connect failed: {“created”:”@1609987901.236921049”,“description”:“Failed to connect to remote host: OS Error”,“errno”:111,“file”:“/rust/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.4.7/grpc/src/core/lib/iomgr/tcp_client_posix.cc”,“file_line”:207,“os_error”:“Connection refused”,“syscall”:“connect”,“target_address”:“ipv4:192.168.50.135:2379”}“]
[2021/01/07 10:51:41.237 +08:00] [INFO] [subchannel.cc:760] [“Subchannel 0x7f19eb3f9e00: Retry in 1000 milliseconds”]
[2021/01/07 10:51:41.237 +08:00] [ERROR] [util.rs:444] [“connect failed”] [err=“Grpc(RpcFailure(RpcStatus { status: Unavailable, details: Some(“Connect Failed”) }))”] [endpoints=http://192.168.50.135:2379]
[2021/01/07 10:51:41.237 +08:00] [INFO] [util.rs:397] [“connecting to PD endpoint”] [endpoints=http://192.168.50.193:2379]
[2021/01/07 10:51:42.239 +08:00] [ERROR] [util.rs:287] [“request failed, retry”] [err=“Grpc(RpcFailure(RpcStatus { status: Unknown, details: Some(“rpc error: code = Unavailable desc = not leader”) }))”]
[2021/01/07 10:51:42.239 +08:00] [INFO] [util.rs:233] [“updating PD client, block the tokio core”]
[2021/01/07 10:51:42.239 +08:00] [INFO] [util.rs:397] [“connecting to PD endpoint”] [endpoints=http://192.168.50.218:2379]
[2021/01/07 10:51:42.240 +08:00] [INFO] [subchannel.cc:878] [“Connect failed: {“created”:”@1609987902.240082753”,“description”:“Failed to connect to remote host: OS Error”,“errno”:111,“file”:“/rust/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.4.7/grpc/src/core/lib/iomgr/tcp_client_posix.cc”,“file_line”:207,“os_error”:“Connection refused”,“syscall”:“connect”,“target_address”:“ipv4:192.168.50.218:2379”}“]
提示连接 pd 失败
3.3 通过 pd-recovery 恢复错误 cluster id
通过 pd-recovery 指定错误 cluster id
[tidb\@tidb01 bin]$ ./pd-recover -endpoints http://192.168.50.193:2379 -cluster-id 6914843637332658222 -alloc-id 2000
recover success! please restart the PD cluster
重启 pd 集群
[tidb\@tidb01 tidb-ansible]$ ansible-playbook stop.yml –tags=pd
查看 tikv 日志如下,例如其中一台 tikv.log 日志
[2021/01/07 11:05:35.834 +08:00] [WARN] [client.rs:55] [“validate PD endpoints failed”] [err=“Other(”[src/pd/util.rs:366]: PD response cluster_id mismatch, want 6914845454507868152, got 6914843637332658222”)“]
3.4 模拟一台 tikv 的数据文件丢失
删除 192.168.50.135 上 tikv 节点的数据目录
3.5 再次通过 pd-recovery 恢复,指定正确 cluster id
./pd-recover -endpoints http://192.168.50.193:2379 -cluster-id 6914843637332658230 -alloc-id 3000
3.6 集群恢复服务,但有一台 tikv 恢复失败
此时 192.168.50.193,192.168.50.218 两台 tikv 节点恢复服务,查看正常的 store 信息
[tidb\@tidb01 bin]$ ./pd-ctl -u http://192.168.50.193:2379 -i
» store
{
“count”: 2,
“stores”: [
{
“store”: {
“id”: 7,
“address”: “192.168.50.193:20160”,
“version”: “3.0.8”,
“state_name”: “Down”
},
“status”: {
“capacity”: “26.98GiB”,
“available”: “12.73GiB”,
“leader_count”: 6,
“leader_weight”: 1,
“leader_score”: 6,
“leader_size”: 6,
“region_count”: 24,
“region_weight”: 1,
“region_score”: 24,
“region_size”: 24,
“start_ts”: “2021-01-07T13:52:30+08:00”,
“last_heartbeat_ts”: “2021-01-07T14:20:45.062857036+08:00”,
“uptime”: “28m15.062857036s”
}
},
{
“store”: {
“id”: 1,
“address”: “192.168.50.218:20160”,
“version”: “3.0.8”,
“state_name”: “Up”
},
“status”: {
“capacity”: “26.98GiB”,
“available”: “24.06GiB”,
“leader_count”: 18,
“leader_weight”: 1,
“leader_score”: 18,
“leader_size”: 18,
“region_count”: 24,
“region_weight”: 1,
“region_score”: 24,
“region_size”: 24,
“start_ts”: “2021-01-07T11:49:51+08:00”,
“last_heartbeat_ts”: “2021-01-07T15:06:02.802625597+08:00”,
“uptime”: “3h16m11.802625597s”
}
}
]
}
查看 192.168.50.135 节点的 tikv.log 日志,
[2021/01/07 15:08:16.356 +08:00] [INFO] [mod.rs:334] [“starting working thread”] [worker=region-collector-worker]
[2021/01/07 15:08:16.356 +08:00] [FATAL] [server.rs:264] [“failed to start node: “[src/server/node.rs:188]: cluster ID mismatch, local 6914843637332658222 != remote 6914843637332658230, you are trying to connect to another cluster, please reconnect to the correct PD””]
提示当前 tikv 记录的 cluster id 是 6914843637332658222, 说明第一次 pd-recovery 恢复了错误的 cluster id 后,因为 tikv 数据文件丢失,因 raft 协议重新拉取数据保存了错误 cluster id。
查看正常节点 tikv.log 日志
[2021/01/07 15:16:08.662 +08:00] [ERROR] [util.rs:327] [“request failed”] [err=“Grpc(RpcFailure(RpcStatus { status: Unknown, details: Some(“invalid store ID 2, not found”) }))”]
[2021/01/07 15:16:08.663 +08:00] [ERROR] [util.rs:327] [“request failed”] [err=“Grpc(RpcFailure(RpcStatus { status: Unknown, details: Some(“invalid store ID 2, not found”) }))”]
解决办法:
1、关闭集群
2、在每台正常的 store 上移除故障的 region
登录到 192.168.50.193
[tidb\@tidb01 bin]$ ./tikv-ctl –db /home/tidb/deploy/data/db unsafe-recover remove-fail-stores -s 2 –all-regions
removing stores [2] from configrations…
success
登录到 192.168.50.218
[tidb\@tidb02 bin]$ ./tikv-ctl –db /home/tidb/deploy/data/db unsafe-recover remove-fail-stores -s 2 –all-regions
removing stores [2] from configrations…
success
3、删除故障节点 tikv 的 data 目录
4、启动集群
版权声明: 本文为 InfoQ 作者【TiDB 社区干货传送门】的原创文章。
原文链接:【http://xie.infoq.cn/article/57649df5eb1a9d144439c446f】。文章转载请联系作者。
评论