写点什么

服务器断电,PD 集群无法启动,修复 PD 集群及 cluster ID mismatch 问题

作者: BingLong 原文来源:https://tidb.net/blog/278cfe2a

1、背景

公司突然停电,ups 也用完电了,来电后开机,发现 tidb 集群起不来了,每次启动 pd 显示成功(其实是失败的),到了 tikv 就卡死,后查看的 tidb-deploy 下的 pd 和 tikv 日志,发现 pd 未启动成功


tikv 问题日志


[2025/06/25 10:53:30.904 +08:00] [INFO] [util.rs:639] ["connecting to PD endpoint"] [endpoints=192.168.0.111:2379] [thread_id=1][2025/06/25 10:53:30.904 +08:00] [INFO] [<unknown>] ["subchannel 0x7f555a8cd000 {address=ipv4:192.168.0.111:2379, args=grpc.client_channel_factory=0x7f555a89cbf0, grpc.default_authority=192.168.0.111:2379, grpc.initial_reconnect_backoff_ms=1000, grpc.internal.subchannel_pool=0x7f555a8344e0, grpc.keepalive_time_ms=10000, grpc.keepalive_timeout_ms=3000, grpc.max_receive_message_length=-1, grpc.max_reconnect_backoff_ms=5000, grpc.max_send_message_length=-1, grpc.primary_user_agent=grpc-rust/0.10.4, grpc.resource_quota=0x7f555a90d510, grpc.server_uri=dns:///192.168.0.111:2379}: connect failed: {\"created\":\"@1750820010.904417027\",\"description\":\"Failed to connect to remote host: No route to host\",\"errno\":113,\"file\":\"/workspace/.cargo/registry/src/mirrors.tuna.tsinghua.edu.cn-df7c3c540f42cdbd/grpcio-sys-0.10.3+1.44.0-patched/grpc/src/core/lib/iomgr/tcp_client_posix.cc\",\"file_line\":205,\"os_error\":\"No route to host\",\"syscall\":\"getsockopt(SO_ERROR)\",\"target_address\":\"ipv4:192.168.0.111:2379\"}"] [thread_id=13][2025/06/25 10:53:30.904 +08:00] [INFO] [<unknown>] ["subchannel 0x7f555a8cd000 {address=ipv4:192.168.0.111:2379, args=grpc.client_channel_factory=0x7f555a89cbf0, grpc.default_authority=192.168.0.111:2379, grpc.initial_reconnect_backoff_ms=1000, grpc.internal.subchannel_pool=0x7f555a8344e0, grpc.keepalive_time_ms=10000, grpc.keepalive_timeout_ms=3000, grpc.max_receive_message_length=-1, grpc.max_reconnect_backoff_ms=5000, grpc.max_send_message_length=-1, grpc.primary_user_agent=grpc-rust/0.10.4, grpc.resource_quota=0x7f555a90d510, grpc.server_uri=dns:///192.168.0.111:2379}: Retry in 1000 milliseconds"] [thread_id=13][2025/06/25 10:53:30.904 +08:00] [INFO] [util.rs:601] ["PD failed to respond"] [err="Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: \"failed to connect to all addresses\", details: [] }))"] [endpoints=192.168.0.111:2379] [thread_id=1][2025/06/25 10:53:30.904 +08:00] [INFO] [util.rs:639] ["connecting to PD endpoint"] [endpoints=192.168.0.112:2379] [thread_id=1][2025/06/25 10:53:32.905 +08:00] [INFO] [util.rs:601] ["PD failed to respond"] [err="Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: \"Deadline Exceeded\", details: [] }))"] [endpoints=192.168.0.112:2379] [thread_id=1][2025/06/25 10:53:33.206 +08:00] [INFO] [util.rs:639] ["connecting to PD endpoint"] [endpoints=192.168.0.110:2379] [thread_id=1][2025/06/25 10:53:35.207 +08:00] [INFO] [util.rs:601] ["PD failed to respond"] [err="Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: \"Deadline Exceeded\", details: [] }))"] [endpoints=192.168.0.110:2379] [thread_id=1][2025/06/25 10:53:35.207 +08:00] [INFO] [util.rs:639] ["connecting to PD endpoint"] [endpoints=192.168.0.111:2379] [thread_id=1][2025/06/25 10:53:37.208 +08:00] [INFO] [util.rs:601] ["PD failed to respond"] [err="Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: \"Deadline Exceeded\", details: [] }))"] [endpoints=192.168.0.111:2379] [thread_id=1][2025/06/25 10:53:37.208 +08:00] [INFO] [util.rs:639] ["connecting to PD endpoint"] [endpoints=192.168.0.112:2379] [thread_id=1][2025/06/25 10:53:39.209 +08:00] [INFO] [util.rs:601] ["PD failed to respond"] [err="Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: \"Deadline Exceeded\", details: [] }))"] [endpoints=192.168.0.112:2379] [thread_id=1][2025/06/25 10:53:39.510 +08:00] [INFO] [util.rs:639] ["connecting to PD endpoint"] [endpoints=192.168.0.110:2379] [thread_id=1][2025/06/25 10:53:41.511 +08:00] [INFO] [util.rs:601] ["PD failed to respond"] [err="Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: \"Deadline Exceeded\", details: [] }))"] [endpoints=192.168.0.110:2379] [thread_id=1][2025/06/25 10:53:41.511 +08:00] [INFO] [util.rs:639] ["connecting to PD endpoint"] [endpoints=192.168.0.111:2379] [thread_id=1][2025/06/25 10:53:43.512 +08:00] [INFO] [util.rs:601] ["PD failed to respond"] [err="Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: \"Deadline Exceeded\", details: [] }))"] [endpoints=192.168.0.111:2379] [thread_id=1][2025/06/25 10:53:43.512 +08:00] [INFO] [util.rs:639] ["connecting to PD endpoint"] [endpoints=192.168.0.112:2379] [thread_id=1][2025/06/25 10:53:45.513 +08:00] [INFO] [util.rs:601] ["PD failed to respond"] [err="Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: \"Deadline Exceeded\", details: [] }))"] [endpoints=192.168.0.112:2379] [thread_id=1][2025/06/25 10:53:45.814 +08:00] [INFO] [util.rs:639] ["connecting to PD endpoint"] [endpoints=192.168.0.110:2379] [thread_id=1][2025/06/25 10:53:47.815 +08:00] [INFO] [util.rs:601] ["PD failed to respond"] [err="Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: \"Deadline Exceeded\", details: [] }))"] [endpoints=192.168.0.110:2379] [thread_id=1][2025/06/25 10:53:47.815 +08:00] [INFO] [util.rs:639] ["connecting to PD endpoint"] [endpoints=192.168.0.111:2379] [thread_id=1][2025/06/25 10:53:47.815 +08:00] [INFO] [<unknown>] ["subchannel 0x7f555a9ca800 {address=ipv4:192.168.0.111:2379, args=grpc.client_channel_factory=0x7f555a89cbf0, grpc.default_authority=192.168.0.111:2379, grpc.initial_reconnect_backoff_ms=1000, grpc.internal.subchannel_pool=0x7f555a8344e0, grpc.keepalive_time_ms=10000, grpc.keepalive_timeout_ms=3000, grpc.max_receive_message_length=-1, grpc.max_reconnect_backoff_ms=5000, grpc.max_send_message_length=-1, grpc.primary_user_agent=grpc-rust/0.10.4, grpc.resource_quota=0x7f555a90d510, grpc.server_uri=dns:///192.168.0.111:2379}: connect failed: {\"created\":\"@1750820027.815459575\",\"description\":\"Failed to connect to remote host: No route to host\",\"errno\":113,\"file\":\"/workspace/.cargo/registry/src/mirrors.tuna.tsinghua.edu.cn-df7c3c540f42cdbd/grpcio-sys-0.10.3+1.44.0-patched/grpc/src/core/lib/iomgr/tcp_client_posix.cc\",\"file_line\":205,\"os_error\":\"No route to host\",\"syscall\":\"getsockopt(SO_ERROR)\",\"target_address\":\"ipv4:192.168.0.111:2379\"}"] [thread_id=13][2025/06/25 10:53:47.815 +08:00] [INFO] [<unknown>] ["subchannel 0x7f555a9ca800 {address=ipv4:192.168.0.111:2379, args=grpc.client_channel_factory=0x7f555a89cbf0, grpc.default_authority=192.168.0.111:2379, grpc.initial_reconnect_backoff_ms=1000, grpc.internal.subchannel_pool=0x7f555a8344e0, grpc.keepalive_time_ms=10000, grpc.keepalive_timeout_ms=3000, grpc.max_receive_message_length=-1, grpc.max_reconnect_backoff_ms=5000, grpc.max_send_message_length=-1, grpc.primary_user_agent=grpc-rust/0.10.4, grpc.resource_quota=0x7f555a90d510, grpc.server_uri=dns:///192.168.0.111:2379}: Retry in 1000 milliseconds"] [thread_id=13][2025/06/25 10:53:47.815 +08:00] [INFO] [util.rs:601] ["PD failed to respond"] [err="Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: \"failed to connect to all addresses\", details: [] }))"] [endpoints=192.168.0.111:2379] [thread_id=1][2025/06/25 10:53:47.815 +08:00] [INFO] [util.rs:639] ["connecting to PD endpoint"] [endpoints=192.168.0.112:2379] [thread_id=1][2025/06/25 10:53:47.815 +08:00] [INFO] [<unknown>] ["subchannel 0x7f555a8c0800 {address=ipv4:192.168.0.112:2379, args=grpc.client_channel_factory=0x7f555a89cbf0, grpc.default_authority=192.168.0.112:2379, grpc.initial_reconnect_backoff_ms=1000, grpc.internal.subchannel_pool=0x7f555a8344e0, grpc.keepalive_time_ms=10000, grpc.keepalive_timeout_ms=3000, grpc.max_receive_message_length=-1, grpc.max_reconnect_backoff_ms=5000, grpc.max_send_message_length=-1, grpc.primary_user_agent=grpc-rust/0.10.4, grpc.resource_quota=0x7f555a90d510, grpc.server_uri=dns:///192.168.0.112:2379}: connect failed: {\"created\":\"@1750820027.815866425\",\"description\":\"Failed to connect to remote host: No route to host\",\"errno\":113,\"file\":\"/workspace/.cargo/registry/src/mirrors.tuna.tsinghua.edu.cn-df7c3c540f42cdbd/grpcio-sys-0.10.3+1.44.0-patched/grpc/src/core/lib/iomgr/tcp_client_posix.cc\",\"file_line\":205,\"os_error\":\"No route to host\",\"syscall\":\"getsockopt(SO_ERROR)\",\"target_address\":\"ipv4:192.168.0.112:2379\"}"] [thread_id=13][2025/06/25 10:53:47.815 +08:00] [INFO] [<unknown>] ["subchannel 0x7f555a8c0800 {address=ipv4:192.168.0.112:2379, args=grpc.client_channel_factory=0x7f555a89cbf0, grpc.default_authority=192.168.0.112:2379, grpc.initial_reconnect_backoff_ms=1000, grpc.internal.subchannel_pool=0x7f555a8344e0, grpc.keepalive_time_ms=10000, grpc.keepalive_timeout_ms=3000, grpc.max_receive_message_length=-1, grpc.max_reconnect_backoff_ms=5000, grpc.max_send_message_length=-1, grpc.primary_user_agent=grpc-rust/0.10.4, grpc.resource_quota=0x7f555a90d510, grpc.server_uri=dns:///192.168.0.112:2379}: Retry in 999 milliseconds"] [thread_id=13][2025/06/25 10:53:47.816 +08:00] [INFO] [util.rs:601] ["PD failed to respond"] [err="Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: \"failed to connect to all addresses\", details: [] }))"] [endpoints=192.168.0.112:2379] [thread_id=1][2025/06/25 10:53:48.116 +08:00] [INFO] [util.rs:639] ["connecting to PD endpoint"] [endpoints=192.168.0.110:2379] [thread_id=1][2025/06/25 10:53:50.117 +08:00] [INFO] [util.rs:601] ["PD failed to respond"] [err="Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: \"Deadline Exceeded\", details: [] }))"] [endpoints=192.168.0.110:2379] [thread_id=1][2025/06/25 10:53:50.117 +08:00] [INFO] [util.rs:639] ["connecting to PD endpoint"] [endpoints=192.168.0.111:2379] [thread_id=1][2025/06/25 10:53:52.118 +08:00] [INFO] [util.rs:601] ["PD failed to respond"] [err="Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: \"Deadline Exceeded\", details: [] }))"] [endpoints=192.168.0.111:2379] [thread_id=1][2025/06/25 10:53:52.118 +08:00] [INFO] [util.rs:639] ["connecting to PD endpoint"] [endpoints=192.168.0.112:2379] [thread_id=1][2025/06/25 10:53:54.119 +08:00] [INFO] [util.rs:601] ["PD failed to respond"] [err="Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: \"Deadline Exceeded\", details: [] }))"] [endpoints=192.168.0.112:2379] [thread_id=1][2025/06/25 10:53:54.420 +08:00] [INFO] [util.rs:639] ["connecting to PD endpoint"] [endpoints=192.168.0.110:2379] [thread_id=1][2025/06/25 10:53:56.421 +08:00] [INFO] [util.rs:601] ["PD failed to respond"] [err="Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: \"Deadline Exceeded\", details: [] }))"] [endpoints=192.168.0.110:2379] [thread_id=1]
复制代码


排查 PD,PD 集群未启动成功


[2025/06/25 11:09:33.494 +08:00] [INFO] [raft] [zap_raft.go:77] ["c269c4a75aa6e6c1 is starting a new election at term 2"][2025/06/25 11:09:33.494 +08:00] [INFO] [raft] [zap_raft.go:77] ["c269c4a75aa6e6c1 became pre-candidate at term 2"][2025/06/25 11:09:33.494 +08:00] [INFO] [raft] [zap_raft.go:77] ["c269c4a75aa6e6c1 received MsgPreVoteResp from c269c4a75aa6e6c1 at term 2"][2025/06/25 11:09:33.494 +08:00] [INFO] [raft] [zap_raft.go:77] ["c269c4a75aa6e6c1 [logterm: 2, index: 1900314] sent MsgPreVote request to 2a308483ede69c8d at term 2"][2025/06/25 11:09:33.494 +08:00] [INFO] [raft] [zap_raft.go:77] ["c269c4a75aa6e6c1 [logterm: 2, index: 1900314] sent MsgPreVote request to bb1260d81c39ab2f at term 2"][2025/06/25 11:09:36.591 +08:00] [WARN] [probing_status.go:68] ["prober detected unhealthy status"] [round-tripper-name=ROUND_TRIPPER_RAFT_MESSAGE] [remote-peer-id=2a308483ede69c8d] [rtt=0s] [error="dial tcp 192.168.0.111:2380: connect: no route to host"][2025/06/25 11:09:36.591 +08:00] [WARN] [probing_status.go:68] ["prober detected unhealthy status"] [round-tripper-name=ROUND_TRIPPER_SNAPSHOT] [remote-peer-id=2a308483ede69c8d] [rtt=0s] [error="dial tcp 192.168.0.111:2380: connect: no route to host"][2025/06/25 11:09:36.594 +08:00] [WARN] [probing_status.go:68] ["prober detected unhealthy status"] [round-tripper-name=ROUND_TRIPPER_RAFT_MESSAGE] [remote-peer-id=bb1260d81c39ab2f] [rtt=0s] [error="dial tcp 192.168.0.112:2380: connect: no route to host"][2025/06/25 11:09:36.594 +08:00] [WARN] [probing_status.go:68] ["prober detected unhealthy status"] [round-tripper-name=ROUND_TRIPPER_SNAPSHOT] [remote-peer-id=bb1260d81c39ab2f] [rtt=0s] [error="dial tcp 192.168.0.112:2380: connect: no route to host"]
复制代码

2、根据网上教程,剔除其中 2 个 pd 节点,试图恢复,未能成功,可以先尝试方法 1,方法 1 简单,数据丢失风险小

https://docs.pingcap.com/zh/tidb/stable/production-deployment-using-tiup/

3、重装 tidb 集群,记录当前集群 id、已分配最大 ID、并备份 tikv 数据磁盘(tidb-data),数据备份最重要

当前集群idcat /www/tidb-deploy/pd-2379/log/pd.log | grep "cluster id"
cluster-id=7510129864311934607
获取已分配 ID,alloc-id 的值在第6步用到,这里查出来的值加1000,就为alloc-id的值grep "idAllocator allocates a new id" {{/path/to}}/pd*.log | awk -F'=' '{print $2}' | awk -F']' '{print $1}' | sort -r -n | head -n 1
复制代码

4、清理掉 pd 的旧数据盘,然后使用 tiup 进行 tidb 集群重新安装,新集群安装好后,将 tikv 数据迁移到新集群 tikv 的 tidb-data 目录下,重启集群

官方部署文档地址


https://docs.pingcap.com/zh/tidb/stable/production-deployment-using-tiup/

5、迁移旧 tikv 数据到新集群后出现 tikv 无法启动问题,日志信息 cluster ID mismatch

6、更改集群 id,将新集群的 cluster ID 修改旧的 集群 id

tiup pd-recover -endpoints http://192.168.0.110:2379 -cluster-id 7510129864311934607 -alloc-id 6000
复制代码

7、重启集群,启动成功问题得到修复

tiup cluster restart tidb-ly
查看状态tiup cluster display tidb-ly
复制代码



发布于: 刚刚阅读数: 2
用户头像

TiDB 社区官网:https://tidb.net/ 2021-12-15 加入

TiDB 社区干货传送门是由 TiDB 社区中布道师组委会自发组织的 TiDB 社区优质内容对外宣布的栏目,旨在加深 TiDBer 之间的交流和学习。一起构建有爱、互助、共创共建的 TiDB 社区 https://tidb.net/

评论

发布
暂无评论
服务器断电,PD集群无法启动,修复PD集群及cluster ID mismatch问题_8.x 实践_TiDB 社区干货传送门_InfoQ写作社区