pd-recovery 后部分 tikv 连接 pd 失败

2022 年 7 月 11 日
本文字数：3395 字
阅读完需：约 11 分钟

作者： szy2010214517 原文来源：https://tidb.net/blog/2bbc4879

一、背景

集群所有 pd 节点的数据目录丢失，导致集群服务不可用，第一次通过 pd-recovery 恢复集群填入了错误的 cluster id，此时其中一台 tikv 的数据目录丢失，再次使用 pd-recovery 恢复整个集群后，因丢失数据的 tikv 节点已经记录了 pd-recovery 错误的 cluster id，导致该 tikv 节点无法正常加入到集群中。

二、环境描述

tidb 版本：3.0.8

ip：集群 Roles 角色

192.168.50.193 pd + tikv + tidb-ansible + monitor

192.168.50.218 tidb + pd + tikv

192.168.50.135 tidb + pd + tikv

三、操作步骤

3.1 查询集群正常状态下的 cluster id

从 pd 节点获取 cluster id
[tidb\@tidb01 log]$ cat pd.log | grep “init cluster id”
[2021/01/07 10:44:54.437 +08:00] [INFO] [server.go:214] [“init cluster id”] [cluster-id=6914843637332658230]
从 pd 日志获取 allocates id
[tidb\@tidb01 log]$ cat pd* | grep “allocates”
[2021/01/07 10:44:56.464 +08:00] [INFO] [id.go:91] [“idAllocator allocates a new id”] [alloc-id=1000]

此时正确的 cluster id 是：6914843637332658230 ， allocates id 是： 1000

3.2 模拟 pd 节点的数据文件丢失

删除三台 pd 节点上的 data.pd 文件目录

删除后 tikv.log 日志如下

[2021/01/07 10:51:41.235 +08:00] [INFO] [subchannel.cc:760] [“Subchannel 0x7f19eb3dc400: Retry in 999 milliseconds”]
[2021/01/07 10:51:41.236 +08:00] [ERROR] [util.rs:444] [“connect failed”] [err=“Grpc(RpcFailure(RpcStatus { status: Unavailable, details: Some(“Connect Failed”) }))”] [endpoints=http://192.168.50.218:2379]
[2021/01/07 10:51:41.236 +08:00] [INFO] [util.rs:397] [“connecting to PD endpoint”] [endpoints=http://192.168.50.135:2379]
[2021/01/07 10:51:41.236 +08:00] [INFO] [subchannel.cc:878] [“Connect failed: {“created”:”@1609987901.236921049”,“description”:“Failed to connect to remote host: OS Error”,“errno”:111,“file”:“/rust/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.4.7/grpc/src/core/lib/iomgr/tcp_client_posix.cc”,“file_line”:207,“os_error”:“Connection refused”,“syscall”:“connect”,“target_address”:“ipv4:192.168.50.135:2379”}“]
[2021/01/07 10:51:41.237 +08:00] [INFO] [subchannel.cc:760] [“Subchannel 0x7f19eb3f9e00: Retry in 1000 milliseconds”]
[2021/01/07 10:51:41.237 +08:00] [ERROR] [util.rs:444] [“connect failed”] [err=“Grpc(RpcFailure(RpcStatus { status: Unavailable, details: Some(“Connect Failed”) }))”] [endpoints=http://192.168.50.135:2379]
[2021/01/07 10:51:41.237 +08:00] [INFO] [util.rs:397] [“connecting to PD endpoint”] [endpoints=http://192.168.50.193:2379]
[2021/01/07 10:51:42.239 +08:00] [ERROR] [util.rs:287] [“request failed, retry”] [err=“Grpc(RpcFailure(RpcStatus { status: Unknown, details: Some(“rpc error: code = Unavailable desc = not leader”) }))”]
[2021/01/07 10:51:42.239 +08:00] [INFO] [util.rs:233] [“updating PD client, block the tokio core”]
[2021/01/07 10:51:42.239 +08:00] [INFO] [util.rs:397] [“connecting to PD endpoint”] [endpoints=http://192.168.50.218:2379]
[2021/01/07 10:51:42.240 +08:00] [INFO] [subchannel.cc:878] [“Connect failed: {“created”:”@1609987902.240082753”,“description”:“Failed to connect to remote host: OS Error”,“errno”:111,“file”:“/rust/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.4.7/grpc/src/core/lib/iomgr/tcp_client_posix.cc”,“file_line”:207,“os_error”:“Connection refused”,“syscall”:“connect”,“target_address”:“ipv4:192.168.50.218:2379”}“]

提示连接 pd 失败

3.3 通过 pd-recovery 恢复错误 cluster id

通过 pd-recovery 指定错误 cluster id
[tidb\@tidb01 bin]$ ./pd-recover -endpoints http://192.168.50.193:2379 -cluster-id 6914843637332658222 -alloc-id 2000
recover success! please restart the PD cluster
重启 pd 集群
[tidb\@tidb01 tidb-ansible]$ ansible-playbook stop.yml –tags=pd

查看 tikv 日志如下，例如其中一台 tikv.log 日志

[2021/01/07 11:05:35.834 +08:00] [WARN] [client.rs:55] [“validate PD endpoints failed”] [err=“Other(”[src/pd/util.rs:366]: PD response cluster_id mismatch, want 6914845454507868152, got 6914843637332658222”)“]

3.4 模拟一台 tikv 的数据文件丢失

删除 192.168.50.135 上 tikv 节点的数据目录

3.5 再次通过 pd-recovery 恢复，指定正确 cluster id

./pd-recover -endpoints http://192.168.50.193:2379 -cluster-id 6914843637332658230 -alloc-id 3000

3.6 集群恢复服务，但有一台 tikv 恢复失败

此时 192.168.50.193，192.168.50.218 两台 tikv 节点恢复服务，查看正常的 store 信息

[tidb\@tidb01 bin]$ ./pd-ctl -u http://192.168.50.193:2379 -i
» store
{
“count”: 2,
“stores”: [
{
“store”: {
“id”: 7,
“address”: “192.168.50.193:20160”,
“version”: “3.0.8”,
“state_name”: “Down”
},
“status”: {
“capacity”: “26.98GiB”,
“available”: “12.73GiB”,
“leader_count”: 6,
“leader_weight”: 1,
“leader_score”: 6,
“leader_size”: 6,
“region_count”: 24,
“region_weight”: 1,
“region_score”: 24,
“region_size”: 24,
“start_ts”: “2021-01-07T13:52:30+08:00”,
“last_heartbeat_ts”: “2021-01-07T14:20:45.062857036+08:00”,
“uptime”: “28m15.062857036s”
}
},
{
“store”: {
“id”: 1,
“address”: “192.168.50.218:20160”,
“version”: “3.0.8”,
“state_name”: “Up”
},
“status”: {
“capacity”: “26.98GiB”,
“available”: “24.06GiB”,
“leader_count”: 18,
“leader_weight”: 1,
“leader_score”: 18,
“leader_size”: 18,
“region_count”: 24,
“region_weight”: 1,
“region_score”: 24,
“region_size”: 24,
“start_ts”: “2021-01-07T11:49:51+08:00”,
“last_heartbeat_ts”: “2021-01-07T15:06:02.802625597+08:00”,
“uptime”: “3h16m11.802625597s”
}
}
]
}

查看 192.168.50.135 节点的 tikv.log 日志，

[2021/01/07 15:08:16.356 +08:00] [INFO] [mod.rs:334] [“starting working thread”] [worker=region-collector-worker]
[2021/01/07 15:08:16.356 +08:00] [FATAL] [server.rs:264] [“failed to start node: “[src/server/node.rs:188]: cluster ID mismatch, local 6914843637332658222 != remote 6914843637332658230, you are trying to connect to another cluster, please reconnect to the correct PD””]

提示当前 tikv 记录的 cluster id 是 6914843637332658222，说明第一次 pd-recovery 恢复了错误的 cluster id 后，因为 tikv 数据文件丢失，因 raft 协议重新拉取数据保存了错误 cluster id。

查看正常节点 tikv.log 日志

[2021/01/07 15:16:08.662 +08:00] [ERROR] [util.rs:327] [“request failed”] [err=“Grpc(RpcFailure(RpcStatus { status: Unknown, details: Some(“invalid store ID 2, not found”) }))”]
[2021/01/07 15:16:08.663 +08:00] [ERROR] [util.rs:327] [“request failed”] [err=“Grpc(RpcFailure(RpcStatus { status: Unknown, details: Some(“invalid store ID 2, not found”) }))”]

解决办法：

1、关闭集群

2、在每台正常的 store 上移除故障的 region

登录到 192.168.50.193
[tidb\@tidb01 bin]$ ./tikv-ctl –db /home/tidb/deploy/data/db unsafe-recover remove-fail-stores -s 2 –all-regions
removing stores [2] from configrations…
success
登录到 192.168.50.218
[tidb\@tidb02 bin]$ ./tikv-ctl –db /home/tidb/deploy/data/db unsafe-recover remove-fail-stores -s 2 –all-regions
removing stores [2] from configrations…
success

3、删除故障节点 tikv 的 data 目录

4、启动集群

发布于: 刚刚阅读数: 3

原文链接:【http://xie.infoq.cn/article/57649df5eb1a9d144439c446f】。文章转载请联系作者。