写点什么

记两个 GC 失效修复的案例

作者: 小老板努力变强原文来源:https://tidb.net/blog/bb482f4d

案例一:

背景:发现集群 gc 未工作,drop table 不回收磁盘空间,发现 gc safepoint 停留在 2024-05-22,集群版本 v5.3.1

1. 日志中以及 admin check table 发现报错

mysql> admin check table SuperXXXall;ERROR 1105 (HY000): unexpected resolve err: commit_ts_expired:<start_ts:450046591374983318 attempted_commit_ts:450046591977652390 key:"t\200\000\000\000\000\0041\337_i\200\000\000\000\000\000\000\002\004\000\000\000\000w\260+\263\003\370\000\000\001*\333\255\365" min_commit_ts:450046592187367871 > , lock: key: 7480000000000431df5f698000000000000002040000000077b02bb303f80000012adbadf5, primary: 7480000000000431df5f698000000000000001038000000013daff1103886518f58e543f44, txnStartTS: 450046591374983318, lockForUpdateTS:4500465913
start_ts:mysql> select tidb_parse_tso(450046591374983318);+------------------------------------+| tidb_parse_tso(450046591374983318) |+------------------------------------+| 2024-05-27 14:31:41.522000 |+------------------------------------+1 row in set (0.00 sec)
attempted_commit_ts:mysql> select tidb_parse_tso(450046591977652390);+------------------------------------+| tidb_parse_tso(450046591977652390) |+------------------------------------+| 2024-05-27 14:31:43.821000 |+------------------------------------+1 row in set (0.00 sec)
min_commit_ts:mysql> select tidb_parse_tso(450046592187367871);+------------------------------------+| tidb_parse_tso(450046592187367871) |+------------------------------------+| 2024-05-27 14:31:44.621000 |+------------------------------------+1 row in set (0.01 sec)
复制代码

2. 确认数据内容

lock key:mysql> select tidb_decode_key('7480000000000431df5f698000000000000002040000000077b02bb303f80000012adbadf5');+-----------------------------------------------------------------------------------------------+| tidb_decode_key('7480000000000431df5f698000000000000002040000000077b02bb303f80000012adbadf5') |+-----------------------------------------------------------------------------------------------+| {"index_id":2,"index_vals":{"ipnum":"3213214421"},"table_id":274911}                          |+-----------------------------------------------------------------------------------------------+1 row in set (0.01 sec)
primary:mysql> select tidb_decode_key('7480000000000431df5f698000000000000001038000000013daff1103886518f58e543f44');+-----------------------------------------------------------------------------------------------+| tidb_decode_key('7480000000000431df5f698000000000000001038000000013daff1103886518f58e543f44') |+-----------------------------------------------------------------------------------------------+| {"index_id":1,"index_vals":{"aid":"111111111","sn":"111111111111111111"},"table_id":274911} |+-----------------------------------------------------------------------------------------------+1 row in set (0.00 sec)
index_id 1 对应 uk_aid_sn。index_id 2 对应 index(ip)
复制代码

3. 确认 region 信息

找对应的 regionlock:fdc@fdc-tidb04-onlinetidb:~$ curl 'http://{ip}:10080/mvcc/hex/7480000000000431df5f698000000000000002040000000077b02bb303f80000012adbadf5'{ "key": "7480000000000431DF5F698000000000000002040000000077B02BB303F80000012ADBADF5", "region_id": 366711639, "value": {  "info": {   "lock": {    "start_ts": 450046591374983318,    "primary": "dIAAAAAABDHfX2mAAAAAAAAAAQOAAAAAE9r/EQOIZRj1jlQ/RA==",    "short_value": "MA=="   },   "writes": [    {     "start_ts": 450046591374983318,     "commit_ts": 450046591977652390,     "short_value": "MA=="    }   ]  } }
primary: fdc@fdc-tidb04-onlinetidb:~$ curl 'http://{ip}:10080/mvcc/hex/7480000000000431df5f698000000000000001038000000013daff1103886518f58e543f44'{ "key": "7480000000000431DF5F698000000000000001038000000013DAFF1103886518F58E543F44", "region_id": 40507174, "value": { "info": { "writes": [ { "start_ts": 450046591374983318, "commit_ts": 450046591977652390, "short_value": "eAAAASrbrfM=" } ] } }
确认到有 lock
复制代码

4. 确认数据分布

tiup ctl:v5.3.1 pd -u {ip}:2379 region key 7480000000000431DF5F698000000000000002040000000077B02BB303F80000012ADBADF5{  "id": 344684358,  "start_key": "748000000000042DFF8400000000000000F8",  "end_key": "7480000000000431FFDF5F698000000000FF0000010380000000FF0000854D03885C91FFF79BAF31BF000000FC",  "epoch": {    "conf_ver": 18923,    "version": 231376  },  "peers": [    {      "id": 1389106238,      "store_id": 348337315,      "role_name": "Voter"    },    {      "id": 1506170971,      "store_id": 477311915,       "role_name": "Voter"    },    {      "id": 1619480854,      "store_id": 935248854,      "role_name": "Voter"    }  ],  "leader": {    "id": 1506170971,    "store_id": 477311915,    "role_name": "Voter"  },  "written_bytes": 2826,  "read_bytes": 0,  "written_keys": 2,  "read_keys": 0,  "approximate_size": 83,  "approximate_keys": 1176620}
复制代码

5. 驱逐 leader,减少对业务的影响

tiup ctl:v5.3.1 pd -u 10.90.230.8:2379 scheduler add evict-leader-scheduler 477311915
# 驱逐完记得加回来tiup ctl:v5.3.1 pd -u 10.90.230.8:2379 scheduler remove evict-leader-scheduler
复制代码

6. 确认 mvcc 信息

对 key 编码fdc@fdc-tidb04-onlinetidb:~$ tiup ctl:v5.3.1 tikv --to-escaped "7480000000000431df5f698000000000000002040000000077b02bb303f80000012adbadf5"t\200\000\000\000\000\0041\337_i\200\000\000\000\000\000\000\002\004\000\000\000\000w\260+\263\003\370\000\000\001*\333\255\365
查看 mvcc tiup ctl:v5.3.1 tikv --data-dir=/ssd1/tidb/tikv/deploy1/data mvcc -k "zt\200\000\000\000\000\0041\337_i\200\000\000\000\000\000\000\002\004\000\000\000\000w\260+\263\003\370\000\000\001*\333\255\365" --show-cf=lock,write,default
fdc@fdc-tidb02-onlinetikv:~$ tiup ctl:v5.3.1 tikv --data-dir=/ssd1/tidb/tikv/deploy1/data mvcc -k "zt\200\000\000\000\000\0041\337_i\200\000\000\000\000\000\000\002\004\000\000\000\000w\260+\263\003\370\000\000\001*\333\255\365" --show-cf=lock,write,defaultStarting component ctl: /home/fdc/.tiup/components/ctl/v5.3.1/ctl tikv --data-dir=/ssd1/tidb/tikv/deploy1/data mvcc -k zt\200\000\000\000\000\0041\337_i\200\000\000\000\000\000\000\002\004\000\000\000\000w\260+\263\003\370\000\000\001*\333\255\365 --show-cf=lock,write,default[2025/09/22 16:13:06.667 +08:00] [INFO] [mod.rs:118] ["encryption: none of key dictionary and file dictionary are found."][2025/09/22 16:13:06.667 +08:00] [INFO] [mod.rs:479] ["encryption is disabled."][2025/09/22 16:13:06.671 +08:00] [WARN] [config.rs:587] ["compaction guard is disabled due to region info provider not available"][2025/09/22 16:13:06.671 +08:00] [WARN] [config.rs:682] ["compaction guard is disabled due to region info provider not available"]no mvcc infos for zt\200\000\000\000\000\0041\337_i\200\000\000\000\000\000\000\002\004\000\000\000\000w\260+\263\003\370\000\000\001*\333\255\365no mvcc infos
recover-mvcc 查看fdc@fdc-tidb02-onlinetikv:~$ tiup ctl:v5.3.1 tikv --data-dir=/ssd1/tidb/tikv/deploy1/data recover-mvcc --read-only -r 40507174 -p 10.90.230.8:2379Starting component ctl: /home/fdc/.tiup/components/ctl/v5.3.1/ctl tikv --data-dir=/ssd1/tidb/tikv/deploy1/data recover-mvcc --read-only -r 40507174 -p ******[2025/09/22 16:17:42.552 +08:00] [INFO] [mod.rs:118] ["encryption: none of key dictionary and file dictionary are found."][2025/09/22 16:17:42.552 +08:00] [INFO] [mod.rs:479] ["encryption is disabled."][2025/09/22 16:17:42.556 +08:00] [WARN] [config.rs:587] ["compaction guard is disabled due to region info provider not available"][2025/09/22 16:17:42.556 +08:00] [WARN] [config.rs:682] ["compaction guard is disabled due to region info provider not available"]Recover regions: [40507174], pd: ["10.90.230.8:2379"], read_only: true[2025/09/22 16:17:45.303 +08:00] [INFO] [util.rs:544] ["connecting to PD endpoint"] [endpoints=10.90.230.8:2379][2025/09/22 16:17:45.304 +08:00] [INFO] [<unknown>] ["TCP_USER_TIMEOUT is available. TCP_USER_TIMEOUT will be used thereafter"][2025/09/22 16:17:45.307 +08:00] [INFO] [<unknown>] ["New connected subchannel at 0x7fa62ac05100 for subchannel 0x7fa66a612280"][2025/09/22 16:17:45.309 +08:00] [INFO] [util.rs:544] ["connecting to PD endpoint"] [endpoints=http://10.191.128.76:2379][2025/09/22 16:17:45.309 +08:00] [INFO] [<unknown>] ["New connected subchannel at 0x7fa62ac05280 for subchannel 0x7fa66a612600"][2025/09/22 16:17:45.310 +08:00] [INFO] [util.rs:544] ["connecting to PD endpoint"] [endpoints=http://10.90.230.8:2379][2025/09/22 16:17:45.313 +08:00] [INFO] [<unknown>] ["New connected subchannel at 0x7fa62ac05400 for subchannel 0x7fa66a612280"][2025/09/22 16:17:45.315 +08:00] [INFO] [util.rs:668] ["connected to PD member"] [endpoints=http://10.90.230.8:2379][2025/09/22 16:17:45.315 +08:00] [INFO] [util.rs:536] ["all PD endpoints are consistent"] [endpoints="[\"10.90.230.8:2379\"]"][2025/09/22 16:17:45.328 +08:00]success! [INFO] [debug.rs:956] ["thread 0: skip write 0 rows"][2025/09/22 16:17:45.328 +08:00] [INFO] [debug.rs:959] ["thread 0: total fix default: 0, lock: 0, write: 0"][2025/09/22 16:17:45.328 +08:00] [INFO] [debug.rs:968] ["thread 0 has finished working."]
但是没有看到具体的 mvcc lock。这个时候很疑惑,难道是找错地方了?
复制代码

7. 确认 region 状态

mysql> select * from TIKV_REGION_STATUS where region_id in (36671107174)\G*************************** 1. row ***************************                REGION_ID: 366711639                START_KEY: 7480000000000431FFDF5F698000000000FF0000020400000000FF77A751AA03B80000FF0021B1B123000000FC                  END_KEY: 7480000000000431FFDF5F698000000000FF0000020400000000FF77B1949A03980000FF0014C54AF8000000FC                 TABLE_ID: 274911                  DB_NAME: DB               TABLE_NAME: SuperXXXall                 IS_INDEX: 1                 INDEX_ID: 2               INDEX_NAME: ip           EPOCH_CONF_VER: 18971            EPOCH_VERSION: 210340            WRITTEN_BYTES: 0               READ_BYTES: 0         APPROXIMATE_SIZE: 70         APPROXIMATE_KEYS: 1108205  REPLICATIONSTATUS_STATE: NULLREPLICATIONSTATUS_STATEID: NULL*************************** 2. row ***************************                REGION_ID: 40507174                START_KEY: 7480000000000431FFDF5F698000000000FF0000010380000000FF13DAFBF103886518FF703DBBA158000000FC                  END_KEY: 7480000000000431FFDF5F698000000000FF0000010380000000FF13DB043303886A23FF907234F2FC000000FC                 TABLE_ID: 274911                  DB_NAME: DB               TABLE_NAME: SuperXXXall                 IS_INDEX: 1                 INDEX_ID: 1               INDEX_NAME: uk_aid_sn           EPOCH_CONF_VER: 19460            EPOCH_VERSION: 210350            WRITTEN_BYTES: 0               READ_BYTES: 0         APPROXIMATE_SIZE: 68         APPROXIMATE_KEYS: 973219  REPLICATIONSTATUS_STATE: NULLREPLICATIONSTATUS_STATEID: NULL2 rows in set (1 min 19.87 sec)
都是索引。如果实在走不通就直接 tombstone region 或者创建空 region 。理论上对原始数据不影响
复制代码

8. 确认 region 位置

tiup ctl:v5.3.1 pd region 366711639Starting component `ctl`: /home/fdc/.tiup/components/ctl/v5.3.1/ctl pd region 366711639{  "id": 366711639,  "start_key": "7480000000000431FFDF5F698000000000FF0000020400000000FF77A751AA03B80000FF0021B1B123000000FC",  "end_key": "7480000000000431FFDF5F698000000000FF0000020400000000FF77B1949A03980000FF0014C54AF8000000FC",  "epoch": {    "conf_ver": 18971,    "version": 210340  },  "peers": [    {      "id": 737290763,      "store_id": 400136022,      "role_name": "Voter"    },    {      "id": 1073621434,      "store_id": 456470117,      "role_name": "Voter"    },    {      "id": 1553665102,      "store_id": 477311917,      "role_name": "Voter"    }  ],  "leader": {    "id": 1073621434,    "store_id": 456470117,    "role_name": "Voter"  },  "written_bytes": 0,  "read_bytes": 0,  "written_keys": 0,  "read_keys": 0,  "approximate_size": 70,  "approximate_keys": 1108205}
复制代码


尝试在新 tikv 处理


对 key 编码查看 mvcc tiup ctl:v5.3.1 tikv --data-dir=/ssd1/tikv/deploy/data mvcc -k "zt\200\000\000\000\000\0041\337_i\200\000\000\000\000\000\000\002\004\000\000\000\000w\260+\263\003\370\000\000\001*\333\255\365" --show-cf=lock,write,default
fdc@fdc-tidb02-onlinetikv:~$ tiup ctl:v5.3.1 tikv --data-dir=/ssd1/tikv/deploy/data mvcc -k "zt\200\000\000\000\000\0041\337_i\200\000\000\000\000\000\000\002\004\000\000\000\000w\260+\263\003\370\000\000\001*\333\255\365" --show-cf=lock,write,defaultStarting component ctl: /home/fdc/.tiup/components/ctl/v5.3.1/ctl tikv --data-dir=/ssd1/tidb/tikv/deploy1/data mvcc -k zt\200\000\000\000\000\0041\337_i\200\000\000\000\000\000\000\002\004\000\000\000\000w\260+\263\003\370\000\000\001*\333\255\365 --show-cf=lock,write,default[2025/09/22 16:13:06.667 +08:00] [INFO] [mod.rs:118] ["encryption: none of key dictionary and file dictionary are found."][2025/09/22 16:13:06.667 +08:00] [INFO] [mod.rs:479] ["encryption is disabled."][2025/09/22 16:13:06.671 +08:00] [WARN] [config.rs:587] ["compaction guard is disabled due to region info provider not available"][2025/09/22 16:13:06.671 +08:00] [WARN] [config.rs:682] ["compaction guard is disabled due to region info provider not available"]no mvcc infos for zt\200\000\000\000\000\0041\337_i\200\000\000\000\000\000\000\002\004\000\000\000\000w\260+\263\003\370\000\000\001*\333\255\365no mvcc infos
recover-mvcc 查看fdc@fdc-tidb06-onlinetikv:~$ tiup ctl:v5.3.1 tikv --data-dir=/ssd1/tikv/deploy/data recover-mvcc --read-only -r 366711639 -p 10.90.230.8:2379Starting component ctl: /home/fdc/.tiup/components/ctl/v5.3.1/ctl tikv --data-dir=/ssd1/tikv/deploy/data recover-mvcc --read-only -r 366711639 -p ******[2025/09/25 11:29:04.220 +08:00] [INFO] [mod.rs:118] ["encryption: none of key dictionary and file dictionary are found."][2025/09/25 11:29:04.220 +08:00] [INFO] [mod.rs:479] ["encryption is disabled."][2025/09/25 11:29:04.224 +08:00] [WARN] [config.rs:587] ["compaction guard is disabled due to region info provider not available"][2025/09/25 11:29:04.224 +08:00] [WARN] [config.rs:682] ["compaction guard is disabled due to region info provider not available"]Recover regions: [366711639], pd: ["10.90.230.8:2379"], read_only: true[2025/09/25 11:29:07.042 +08:00] [INFO] [util.rs:544] ["connecting to PD endpoint"] [endpoints=10.90.230.8:2379][2025/09/25 11:29:07.044 +08:00] [INFO] [<unknown>] ["Disabling AF_INET6 sockets because socket() failed."][2025/09/25 11:29:07.044 +08:00] [INFO] [<unknown>] ["TCP_USER_TIMEOUT is available. TCP_USER_TIMEOUT will be used thereafter"][2025/09/25 11:29:07.047 +08:00] [INFO] [<unknown>] ["New connected subchannel at 0x7f9eb3606680 for subchannel 0x7f9ef7211f00"][2025/09/25 11:29:07.049 +08:00] [INFO] [util.rs:544] ["connecting to PD endpoint"] [endpoints=http://10.191.128.76:2379][2025/09/25 11:29:07.050 +08:00] [INFO] [<unknown>] ["New connected subchannel at 0x7f9eb3606800 for subchannel 0x7f9ef7212280"][2025/09/25 11:29:07.050 +08:00] [INFO] [util.rs:544] ["connecting to PD endpoint"] [endpoints=http://10.90.230.8:2379][2025/09/25 11:29:07.054 +08:00] [INFO] [<unknown>] ["New connected subchannel at 0x7f9eb3606980 for subchannel 0x7f9ef7211f00"][2025/09/25 11:29:07.056 +08:00] [INFO] [util.rs:668] ["connected to PD member"] [endpoints=http://10.90.230.8:2379][2025/09/25 11:29:07.056 +08:00] [INFO] [util.rs:536] ["all PD endpoints are consistent"] [endpoints="[\"10.90.230.8:2379\"]"][2025/09/25 11:29:07.257 +08:00] [INFO] [debug.rs:1098] ["thread 0: LOCK for_update_ts is less than WRITE ts, key: 7480000000000431FFDF5F698000000000FF0000020400000000FF77B02BB303F80000FF012ADBADF5000000FC, for_update_ts: 450046591374983318, commit_ts: 450046591977652390"][2025/09/25 11:29:07.580 +08:00] [INFO] [debug.rs:1063] ["thread 0: scan 1000000 rows"][2025/09/25 11:29:07.614 +08:00] [INFO] [debug.rs:956] ["thread 0: skip write 1 rows"][2025/09/25 11:29:07.614 +08:00] [INFO] [debug.rs:959] ["thread 0: total fix default: 0, lock: 1, write: 0"][2025/09/25 11:29:07.614 +08:00] [INFO] [debug.rs:968] ["thread 0 has finished working."]success!
但是没有看到具体的 mvcc lock
复制代码


至少看到了 key 的锁,先尝试修复,在其他 store 上同样处理。最终在三个 store 上处理完后解决问题,GC 正常获取到 lock ,回收空间。

案例二:

1. 从日志确认报错

[“resolve store address failed”] [err_code=KV:Unknown] [err=“Other(\“[src/server/resolve.rs:124]: unknown error \\\“[components/pd_client/src/util.rs:954]: invalid store ID 8769963922, not found\\\“\”)“] [store_id=8769963922] [thread_id=203]


[raft_client.rs:829] [“resolve store address failed”] [err_code=KV:Unknown] [err=“Other(\“[src/server/resolve.rs:124]: unknown error \\\“[components/pd_client/src/util.rs:954]: invalid store ID 32996937412, not found\\\“\”)“] [store_id=32996937412] [thread_id=203]


[raft_client.rs:829] [“resolve store address failed”] [err_code=KV:Unknown] [err=“Other(\“[src/server/resolve.rs:124]: unknown error \\\“[components/pd_client/src/util.rs:954]: invalid store ID 15351649554, not found\\\“\”)“] [store_id=15351649554] [thread_id=203]


2025-09-11 11:20:23 (UTC+08:00)PD 10.191.0.46:2379[operator_controller.go:944] [“invalid store ID”] [store-id=8769963922]


2025-09-11 11:21:56 (UTC+08:00)PD 10.191.0.46:2379[operator_controller.go:944] [“invalid store ID”] [store-id=15351649554]

2. 尝试重启 kv

论坛中有类似情况重启 kv 可以暂时解决:https://asktug.com/t/topic/1045115/1
重启后无效果,kv日志还是存在报错[2025/09/11 14:45:58.229 +08:00] [INFO] [resolve.rs:121] ["resolve store not found"] [store_id=32996937412] [thread_id=18][2025/09/11 14:45:58.229 +08:00] [ERROR] [raft_client.rs:829] ["resolve store address failed"] [err_code=KV:Unknown] [err="Other(\"[src/server/resolve.rs:124]: unknown error \\\"[components/pd_client/src/util.rs:954]: invalid store ID 32996937412, not found\\\"\")"] [store_id=32996937412] [thread_id=202]
复制代码

3. 查看是否有异常 region

 tiup ctl:v7.5.4 pd -u 10.191.0.46:2379 region check down-peer{    "count": 2,    "regions": [        {            "id": 20278184100,            "start_key": "7480000000000365FF955F698000000000FF0000020146334535FF35444144FF464645FF4441353938FF4644FF453136423143FF41FF31374538393446FFFF0000000000000000FFF703800000281D38FF963E000000000000F9",            "end_key": "7480000000000365FF955F698000000000FF0000020146383942FF41383535FF303333FF3843363630FF3333FF464243333245FF37FF39423243463832FFFF0000000000000000FFF703800000281B55FFFC7B000000000000F9",            "epoch": {                "conf_ver": 9382,                "version": 103309            },            "peers": [                {                    "role_name": "Learner",                    "is_learner": true,                    "id": 21571927681,                    "store_id": 8769963922,                    "role": 1                },                {                    "role_name": "Voter",                    "id": 22597784773,                    "store_id": 23                },                {                    "role_name": "Learner",                    "is_learner": true,                    "id": 22603334967,                    "store_id": 8769963926,                    "role": 1                },                {                    "role_name": "Voter",                    "id": 33232384663,                    "store_id": 32996937412                }            ],            "leader": {                "role_name": "Voter",                "id": 22597784773,                "store_id": 23            },            "down_peers": [                {                    "peer": {                        "role_name": "Learner",                        "is_learner": true,                        "id": 21571927681,                        "store_id": 8769963922,                        "role": 1                    },                    "down_seconds": 8285684                },                {                    "peer": {                        "role_name": "Learner",                        "is_learner": true,                        "id": 22603334967,                        "store_id": 8769963926,                        "role": 1                    },                    "down_seconds": 8285684                },                {                    "peer": {                        "role_name": "Voter",                        "id": 33232384663,                        "store_id": 32996937412                    },                    "down_seconds": 8285684                }            ],            "pending_peers": [                {                    "role_name": "Learner",                    "is_learner": true,                    "id": 21571927681,                    "store_id": 8769963922,                    "role": 1                },                {                    "role_name": "Learner",                    "is_learner": true,                    "id": 22603334967,                    "store_id": 8769963926,                    "role": 1                },                {                    "role_name": "Voter",                    "id": 33232384663,                    "store_id": 32996937412                }            ],            "cpu_usage": 0,            "written_bytes": 0,            "read_bytes": 0,            "written_keys": 0,            "read_keys": 0,            "approximate_size": 465,            "approximate_keys": 4353659        },        {            "id": 29486864661,            "start_key": "74800000000004E2FF605F728000003695FF505EE90000000000FA",            "end_key": "74800000000004E2FF605F728000003695FF65A3D10000000000FA",            "epoch": {                "conf_ver": 13178,                "version": 143231            },            "peers": [                {                    "role_name": "Voter",                    "id": 29963431801,                    "store_id": 21961705754                },                {                    "role_name": "Learner",                    "is_learner": true,                    "id": 29976105937,                    "store_id": 15351649554,                    "role": 1                },                {                    "role_name": "Learner",                    "is_learner": true,                    "id": 29998759908,                    "store_id": 8769963926,                    "role": 1                },                {                    "role_name": "Voter",                    "id": 33232575337,                    "store_id": 32996937412                }            ],            "leader": {                "role_name": "Voter",                "id": 29963431801,                "store_id": 21961705754            },            "down_peers": [                {                    "peer": {                        "role_name": "Learner",                        "is_learner": true,                        "id": 29976105937,                        "store_id": 15351649554,                        "role": 1                    },                    "down_seconds": 8285666                },                {                    "peer": {                        "role_name": "Learner",                        "is_learner": true,                        "id": 29998759908,                        "store_id": 8769963926,                        "role": 1                    },                    "down_seconds": 8285666                },                {                    "peer": {                        "role_name": "Voter",                        "id": 33232575337,                        "store_id": 32996937412                    },                    "down_seconds": 8285666                }            ],            "pending_peers": [                {                    "role_name": "Learner",                    "is_learner": true,                    "id": 29976105937,                    "store_id": 15351649554,                    "role": 1                },                {                    "role_name": "Learner",                    "is_learner": true,                    "id": 29998759908,                    "store_id": 8769963926,                    "role": 1                },                {                    "role_name": "Voter",                    "id": 33232575337,                    "store_id": 32996937412                }            ],            "cpu_usage": 0,            "written_bytes": 0,            "read_bytes": 0,            "written_keys": 0,            "read_keys": 0,            "approximate_size": 767,            "approximate_keys": 1350662        }    ]}
复制代码


尝试补回副本


观察到均有在异常 store 上的 region ,尝试在其他节点补回。确认到第一个 region 20278184100。只剩一个 leader 还在。尝试补副本
>> operator add add-peer 1 2 // 在 store 2 上新增 Region 1 的一个副本>> operator add add-learner 1 2 // 在 store 2 上新增 Region 1 的一个 learner 副本
在 store 16 、33472595794 上补tiup ctl:v7.5.4 pd -u {ip}:2379 operator add add-peer 20278184100 16Success! The operator is created."
不能同时多个,只能一个个 store 跑tiup ctl:v7.5.4 pd -u {ip}:2379 operator add add-peer 20278184100 33472595794
复制代码


尝试补回副本,确认 gc 配置


涉及的表:mysql> select * from TIKV_REGION_STATUS where REGION_ID = 20278184100\G*************************** 1. row ***************************                REGION_ID: 20278184100                START_KEY: 7480000000000365FF955F698000000000FF0000020146334535FF35444144FF464645FF4441353938FF4644FF453136423143FF41FF31374538393446FFFF0000000000000000FFF703800000281D38FF963E000000000000F9                  END_KEY: 7480000000000365FF955F698000000000FF0000020146383942FF41383535FF303333FF3843363630FF3333FF464243333245FF37FF39423243463832FFFF0000000000000000FFF703800000281B55FFFC7B000000000000F9                 TABLE_ID: NULL                  DB_NAME: NULL               TABLE_NAME: NULL                 IS_INDEX: 0                 INDEX_ID: NULL               INDEX_NAME: NULL             IS_PARTITION: 0             PARTITION_ID: NULL           PARTITION_NAME: NULL           EPOCH_CONF_VER: 9382            EPOCH_VERSION: 103309            WRITTEN_BYTES: 0               READ_BYTES: 0         APPROXIMATE_SIZE: 465         APPROXIMATE_KEYS: 4353659  REPLICATIONSTATUS_STATE: NULLREPLICATIONSTATUS_STATEID: NULL1 row in set (32.32 sec)
mysql> select * from TIKV_REGION_STATUS where REGION_ID = 29486864661\G*************************** 1. row *************************** REGION_ID: 29486864661 START_KEY: 74800000000004E2FF605F728000003695FF505EE90000000000FA END_KEY: 74800000000004E2FF605F728000003695FF65A3D10000000000FA TABLE_ID: 10505 DB_NAME: mars_p1log TABLE_NAME: loginrole IS_INDEX: 0 INDEX_ID: NULL INDEX_NAME: NULL IS_PARTITION: 1 PARTITION_ID: 320096 PARTITION_NAME: p20240815 EPOCH_CONF_VER: 13178 EPOCH_VERSION: 143231 WRITTEN_BYTES: 0 READ_BYTES: 0 APPROXIMATE_SIZE: 767 APPROXIMATE_KEYS: 1350662 REPLICATIONSTATUS_STATE: NULLREPLICATIONSTATUS_STATEID: NULL1 row in set (31.31 sec)
最终 operator 超时 51m
尝试 remove-peer>> operator add remove-peer 1 2 // 移除 store 2 上的 Region 1 的一个副本tiup ctl:v7.5.4 pd -u {ip}:2379 operator check 20278184100tiup ctl:v7.5.4 pd -u {ip}:2379 operator remove 20278184100tiup ctl:v7.5.4 pd -u {ip}:2379 operator add remove-peer 20278184100 8769963922
第一个空 region 的最终也超时[2025/09/11 16:21:55.912 +08:00] [INFO] [operator_controller.go:659] ["operator timeout"] [region-id=20278184100] [takes=4m39.750985291s] [operator="\"admin-remove-peer {rm peer: store [8769963922]} (kind:admin,region, region:20278184100(103309, 9382), createAt:2025-09-11 16:17:16.161505793 +0800 CST m=+17048519.309387956, startAt:2025-09-11 16:17:16.161565328 +0800 CST m=+17048519.309447485, currentStep:0, size:465, steps:[0:{remove peer on store 8769963922}], timeout:[4m39s]) timeout\""] [additional-info="{\"cancel-reason\":\"timeout\"}"]
尝试给第二个补 region,也是超时fdc@fdc-tidb01-tidbp1:~$ tiup ctl:v7.5.4 pd -u {ip}:2379 operator add add-peer 29486864661 16Success! The operator is created."
# 检查各项配置mysql> show config where name like '%enable-remove-down-replica';+------+------------------+-------------------------------------+-------+| Type | Instance | Name | Value |+------+------------------+-------------------------------------+-------+| pd | {ip}:2379 | schedule.enable-remove-down-replica | true || pd | {ip}:2379 | schedule.enable-remove-down-replica | true || pd | {ip}:2379 | schedule.enable-remove-down-replica | true |+------+------------------+-------------------------------------+-------+3 rows in set (0.06 sec)
mysql> show config where name like '%enable-replace-offline-replica';+------+------------------+-----------------------------------------+-------+| Type | Instance | Name | Value |+------+------------------+-----------------------------------------+-------+| pd | {ip}:2379 | schedule.enable-replace-offline-replica | true || pd | {ip}:2379 | schedule.enable-replace-offline-replica | true || pd | {ip}:2379 | schedule.enable-replace-offline-replica | true |+------+------------------+-----------------------------------------+-------+
配置也确认了都开启的
复制代码


准备执行有损回复,最终也跑不了


fdc@fdc-tidb01-tidbp1:~/tipd/deploy/log$ tiup ctl:v7.5.4 pd -u {ip}:2379 store 32996937412Failed to get store: [404] "[PD:core:ErrStoreNotFound]store 32996937412 not found"
fdc@fdc-tidb01-tidbp1:~/tipd/deploy/log$ tiup ctl:v7.5.4 pd -u {ip}:2379 store 8769963922Failed to get store: [404] "[PD:core:ErrStoreNotFound]store 8769963922 not found"
fdc@fdc-tidb01-tidbp1:~/tipd/deploy/log$ tiup ctl:v7.5.4 pd -u {ip}:2379 store 8769963926Failed to get store: [404] "[PD:core:ErrStoreNotFound]store 8769963926 not found"
----------------
tiup ctl:v7.5.4 pd -u {ip}:2379 unsafe remove-failed-stores 8769963922,8769963926,32996937412Failed! [500] "[PD:unsaferecovery:ErrUnsafeRecoveryInvalidInput]invalid input store 32996937412 doesn't exist"
复制代码

5. 尝试 recreate_region

https://tidb.net/blog/ddef26a5#4%C2%A0%C2%A0%20%E5%BC%82%E5%B8%B8%E5%A4%84%E7%90%86%E4%B8%89%E6%9D%BF%E6%96%A7/4.3%20%E7%AC%AC%E4%B8%89%E6%8B%9B%EF%BC%9A%E9%87%8D%E5%BB%BAregion
4.3 第三招:重建region 如果region的副本全部丢失或仅少量的几个无数据空region无法选出leader时可以使用recreate-region方式重建region。(1) 副本全部丢失,执行了多副本失败恢复检查副本全部丢失的region,if内指定故障tikv的store_idpd-ctl region --jq='.regions[] | {id: .id, peer_stores: [.peers[].store_id] | select(length as $total |map(if .==(4,5,7) then . else empty end)|length>$total-length)}' |sort(2) 少量region无数据且无法选主,未对集群做任何处理 使用curl http://tidb_ip:10080/regions/{region_id} 检查该region上的对象信息,如果frames 字段为空的话则说明该region为无数据的空region,重建无影响,否则会丢失数据。(3) 重建region 关闭region涉及的存活tikv实例,然后在其中一个正常tikv上执行: tikv-ctl --data-dir /data/tidb-data/tikv-20160 recreate-region -p 'pd_ip:pd_port' -r <region_id> 注意:以前版本使用--db参数而非--data-dir,指定目录为正常tikv的。另外复制命令时注意引号、单横线是否是中文格式。
复制代码


确认 region 内容


curl 'http://{ip}:10081/regions/20278184100'{ "start_key": "dIAAAAAAA2WVX2mAAAAAAAAAAgFGM0U1NURBRP9GRkVEQTU5OP9GREUxNkIxQ/9BMTdFODk0Rv8AAAAAAAAAAPcDgAAAKB04lj4=", "end_key": "dIAAAAAAA2WVX2mAAAAAAAAAAgFGODlCQTg1Nf8wMzM4QzY2MP8zM0ZCQzMyRf83OUIyQ0Y4Mv8AAAAAAAAAAPcDgAAAKBtV/Hs=", "start_key_hex": "7480000000000365955f698000000000000002014633453535444144ff4646454441353938ff4644453136423143ff4131374538393446ff0000000000000000f703800000281d38963e", "end_key_hex": "7480000000000365955f698000000000000002014638394241383535ff3033333843363630ff3333464243333245ff3739423243463832ff0000000000000000f703800000281b55fc7b", "region_id": 20278184100, "frames": null curl 'http://{ip}:10081/regions/29486864661'{ "start_key": "dIAAAAAABOJgX3KAAAA2lVBe6Q==", "end_key": "dIAAAAAABOJgX3KAAAA2lWWj0Q==", "start_key_hex": "74800000000004e2605f728000003695505ee9", "end_key_hex": "74800000000004e2605f72800000369565a3d1", "region_id": 29486864661, "frames": [  {   "db_name": "db",   "table_name": "table",   "table_id": 320096,   "is_record": true, (包含具体记录,不是索引)   "record_id": 234433306345  } ]}
frames 空的没影响。但最终也是没效果
复制代码

6. unsafe remove-failed-stores –auto-detect 最终方案

文章:https://asktug.com/t/topic/1029821/105
tiup ctl:v7.5.4 pd -u {ip}6:2379 unsafe remove-failed-stores --auto-detect
fdc@fdc-tidb01-tidbp1:~$ tiup ctl:v7.5.4 pd -u {ip}:2379 unsafe remove-failed-stores showStarting component ctl: /home/fdc/.tiup/components/ctl/v7.5.4/ctl pd -u 10.191.0.46:2379 unsafe remove-failed-stores show[ { "info": "Unsafe recovery enters collect report stage", "time": "2025-09-23 09:40:55.189", "details": [ "auto detect mode with no specified Failed stores" ] }, { "info": "Unsafe recovery enters demote Failed voter stage", "time": "2025-09-23 09:41:44.825", "actions": { "store 21961705754": [ "tombstone the peer of region 29486864661" ], "store 23": [ "tombstone the peer of region 20278184100" ] } }, { "info": "Unsafe recovery Finished", "time": "2025-09-23 09:43:13.212", "details": [ "affected table ids: 320096, 222613", "no newly created empty regions" ] }]
复制代码


执行完后,能正常获取到锁,执行 GC

总结

恢复方案:


  1. tikv recover-mvcc 工具清理获取不到的锁

  2. unsafe remove-failed-stores –auto-detect 清理残留的 storeid 信息


运维操作:


  1. tikv 下线过程中,尽量不使用 force scale in 。预留多点时间操作,走正常的下线流程

  2. 及时升级集群版本

  3. 做好多个组件 GC safepoint 的监控,两个案例里面都是 tidb 和 pd 的 safepoint 正常,但是 tikv 一直无法进行正常的 GC 回收空间,及时进行处理


发布于: 刚刚阅读数: 2
用户头像

TiDB 社区官网:https://tidb.net/ 2021-12-15 加入

TiDB 社区干货传送门是由 TiDB 社区中布道师组委会自发组织的 TiDB 社区优质内容对外宣布的栏目,旨在加深 TiDBer 之间的交流和学习。一起构建有爱、互助、共创共建的 TiDB 社区 https://tidb.net/

评论

发布
暂无评论
记两个 GC 失效修复的案例_管理与运维_TiDB 社区干货传送门_InfoQ写作社区