写点什么

pd-recovery 后部分 tikv 连接 pd 失败

  • 2022 年 7 月 11 日
  • 本文字数:3395 字

    阅读完需:约 11 分钟

作者: szy2010214517 原文来源:https://tidb.net/blog/2bbc4879


一、背景


集群所有 pd 节点的数据目录丢失,导致集群服务不可用,第一次通过 pd-recovery 恢复集群填入了错误的 cluster id,此时其中一台 tikv 的数据目录丢失,再次使用 pd-recovery 恢复整个集群后, 因丢失数据的 tikv 节点已经记录了 pd-recovery 错误的 cluster id, 导致该 tikv 节点无法正常加入到集群中。


二、环境描述


tidb 版本:3.0.8


ip: 集群 Roles 角色


192.168.50.193 pd + tikv + tidb-ansible + monitor


192.168.50.218 tidb + pd + tikv


192.168.50.135 tidb + pd + tikv


三、操作步骤


3.1 查询集群正常状态下的 cluster id


从 pd 节点获取 cluster id

[tidb\@tidb01 log]$ cat pd.log | grep “init cluster id”

[2021/01/07 10:44:54.437 +08:00] [INFO] [server.go:214] [“init cluster id”] [cluster-id=6914843637332658230]

从 pd 日志获取 allocates id

[tidb\@tidb01 log]$ cat pd* | grep “allocates”

[2021/01/07 10:44:56.464 +08:00] [INFO] [id.go:91] [“idAllocator allocates a new id”] [alloc-id=1000]


此时正确的 cluster id 是:6914843637332658230 , allocates id 是: 1000


3.2 模拟 pd 节点的数据文件丢失


删除三台 pd 节点上的 data.pd 文件目录


删除后 tikv.log 日志如下


[2021/01/07 10:51:41.235 +08:00] [INFO] [subchannel.cc:760] [“Subchannel 0x7f19eb3dc400: Retry in 999 milliseconds”]

[2021/01/07 10:51:41.236 +08:00] [ERROR] [util.rs:444] [“connect failed”] [err=“Grpc(RpcFailure(RpcStatus { status: Unavailable, details: Some(“Connect Failed”) }))”] [endpoints=http://192.168.50.218:2379]

[2021/01/07 10:51:41.236 +08:00] [INFO] [util.rs:397] [“connecting to PD endpoint”] [endpoints=http://192.168.50.135:2379]

[2021/01/07 10:51:41.236 +08:00] [INFO] [subchannel.cc:878] [“Connect failed: {“created”:”@1609987901.236921049”,“description”:“Failed to connect to remote host: OS Error”,“errno”:111,“file”:“/rust/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.4.7/grpc/src/core/lib/iomgr/tcp_client_posix.cc”,“file_line”:207,“os_error”:“Connection refused”,“syscall”:“connect”,“target_address”:“ipv4:192.168.50.135:2379”}“]

[2021/01/07 10:51:41.237 +08:00] [INFO] [subchannel.cc:760] [“Subchannel 0x7f19eb3f9e00: Retry in 1000 milliseconds”]

[2021/01/07 10:51:41.237 +08:00] [ERROR] [util.rs:444] [“connect failed”] [err=“Grpc(RpcFailure(RpcStatus { status: Unavailable, details: Some(“Connect Failed”) }))”] [endpoints=http://192.168.50.135:2379]

[2021/01/07 10:51:41.237 +08:00] [INFO] [util.rs:397] [“connecting to PD endpoint”] [endpoints=http://192.168.50.193:2379]

[2021/01/07 10:51:42.239 +08:00] [ERROR] [util.rs:287] [“request failed, retry”] [err=“Grpc(RpcFailure(RpcStatus { status: Unknown, details: Some(“rpc error: code = Unavailable desc = not leader”) }))”]

[2021/01/07 10:51:42.239 +08:00] [INFO] [util.rs:233] [“updating PD client, block the tokio core”]

[2021/01/07 10:51:42.239 +08:00] [INFO] [util.rs:397] [“connecting to PD endpoint”] [endpoints=http://192.168.50.218:2379]

[2021/01/07 10:51:42.240 +08:00] [INFO] [subchannel.cc:878] [“Connect failed: {“created”:”@1609987902.240082753”,“description”:“Failed to connect to remote host: OS Error”,“errno”:111,“file”:“/rust/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.4.7/grpc/src/core/lib/iomgr/tcp_client_posix.cc”,“file_line”:207,“os_error”:“Connection refused”,“syscall”:“connect”,“target_address”:“ipv4:192.168.50.218:2379”}“]


提示连接 pd 失败


3.3 通过 pd-recovery 恢复错误 cluster id


通过 pd-recovery 指定错误 cluster id

[tidb\@tidb01 bin]$ ./pd-recover -endpoints http://192.168.50.193:2379 -cluster-id 6914843637332658222 -alloc-id 2000

recover success! please restart the PD cluster

重启 pd 集群

[tidb\@tidb01 tidb-ansible]$ ansible-playbook stop.yml –tags=pd


查看 tikv 日志如下,例如其中一台 tikv.log 日志


[2021/01/07 11:05:35.834 +08:00] [WARN] [client.rs:55] [“validate PD endpoints failed”] [err=“Other(”[src/pd/util.rs:366]: PD response cluster_id mismatch, want 6914845454507868152, got 6914843637332658222”)“]


3.4 模拟一台 tikv 的数据文件丢失


删除 192.168.50.135 上 tikv 节点的数据目录


3.5 再次通过 pd-recovery 恢复,指定正确 cluster id


./pd-recover -endpoints http://192.168.50.193:2379 -cluster-id 6914843637332658230 -alloc-id 3000


3.6 集群恢复服务,但有一台 tikv 恢复失败


此时 192.168.50.193,192.168.50.218 两台 tikv 节点恢复服务,查看正常的 store 信息


[tidb\@tidb01 bin]$ ./pd-ctl -u http://192.168.50.193:2379 -i

» store

{

“count”: 2,

“stores”: [

{

“store”: {

“id”: 7,

“address”: “192.168.50.193:20160”,

“version”: “3.0.8”,

“state_name”: “Down”

},

“status”: {

“capacity”: “26.98GiB”,

“available”: “12.73GiB”,

“leader_count”: 6,

“leader_weight”: 1,

“leader_score”: 6,

“leader_size”: 6,

“region_count”: 24,

“region_weight”: 1,

“region_score”: 24,

“region_size”: 24,

“start_ts”: “2021-01-07T13:52:30+08:00”,

“last_heartbeat_ts”: “2021-01-07T14:20:45.062857036+08:00”,

“uptime”: “28m15.062857036s”

}

},

{

“store”: {

“id”: 1,

“address”: “192.168.50.218:20160”,

“version”: “3.0.8”,

“state_name”: “Up”

},

“status”: {

“capacity”: “26.98GiB”,

“available”: “24.06GiB”,

“leader_count”: 18,

“leader_weight”: 1,

“leader_score”: 18,

“leader_size”: 18,

“region_count”: 24,

“region_weight”: 1,

“region_score”: 24,

“region_size”: 24,

“start_ts”: “2021-01-07T11:49:51+08:00”,

“last_heartbeat_ts”: “2021-01-07T15:06:02.802625597+08:00”,

“uptime”: “3h16m11.802625597s”

}

}

]

}


查看 192.168.50.135 节点的 tikv.log 日志,


[2021/01/07 15:08:16.356 +08:00] [INFO] [mod.rs:334] [“starting working thread”] [worker=region-collector-worker]

[2021/01/07 15:08:16.356 +08:00] [FATAL] [server.rs:264] [“failed to start node: “[src/server/node.rs:188]: cluster ID mismatch, local 6914843637332658222 != remote 6914843637332658230, you are trying to connect to another cluster, please reconnect to the correct PD””]


提示当前 tikv 记录的 cluster id 是 6914843637332658222, 说明第一次 pd-recovery 恢复了错误的 cluster id 后,因为 tikv 数据文件丢失,因 raft 协议重新拉取数据保存了错误 cluster id。


查看正常节点 tikv.log 日志


[2021/01/07 15:16:08.662 +08:00] [ERROR] [util.rs:327] [“request failed”] [err=“Grpc(RpcFailure(RpcStatus { status: Unknown, details: Some(“invalid store ID 2, not found”) }))”]

[2021/01/07 15:16:08.663 +08:00] [ERROR] [util.rs:327] [“request failed”] [err=“Grpc(RpcFailure(RpcStatus { status: Unknown, details: Some(“invalid store ID 2, not found”) }))”]


解决办法:


1、关闭集群


2、在每台正常的 store 上移除故障的 region


登录到 192.168.50.193

[tidb\@tidb01 bin]$ ./tikv-ctl –db /home/tidb/deploy/data/db unsafe-recover remove-fail-stores -s 2 –all-regions

removing stores [2] from configrations…

success

登录到 192.168.50.218

[tidb\@tidb02 bin]$ ./tikv-ctl –db /home/tidb/deploy/data/db unsafe-recover remove-fail-stores -s 2 –all-regions

removing stores [2] from configrations…

success


3、删除故障节点 tikv 的 data 目录


4、启动集群


发布于: 刚刚阅读数: 3
用户头像

TiDB 社区官网:https://tidb.net/ 2021.12.15 加入

TiDB 社区干货传送门是由 TiDB 社区中布道师组委会自发组织的 TiDB 社区优质内容对外宣布的栏目,旨在加深 TiDBer 之间的交流和学习。一起构建有爱、互助、共创共建的 TiDB 社区 https://tidb.net/

评论

发布
暂无评论
pd-recovery后部分tikv连接pd失败_TiDB 社区干货传送门_InfoQ写作社区