记一次因 GC bug 导致 TiKV 存储占用不均的问题处理

2022-12-30
北京
本文字数：2285 字
阅读完需：约 7 分钟

作者： OnTheRoad 原文来源：https://tidb.net/blog/82d51e46

1. 问题描述

1.1. 环境描述

1.2. 问题现象

1.2.1. Dashboard 日志

Dashboard 存在大量 ERROR 级别的关于 gc worker 的报错日志，内容如下：

[gc_worker.go:713] ["[gc worker] delete range failed on range"] [uuid=60a807a27f00012] [startKey=7480000000000017ac] [endKey=7480000000000017ad] [error="[gc worker] destroy range finished with errors: [unsafe destroy range failed on store 1: gc worker is too busy unsafe destroy range failed on store 3: gc worker is too busy unsafe destroy range failed on store 2: gc worker is too busy]"]

复制代码

1.2.2. TiKV 空间占用不均

Grafana 监控面板（路径：kruidb-PD->Statistics-balance）显示 3 个 TiKV 节点空间占用相差较大。

2. 问题分析

2.1. 系统视图查看 region 分布

mysql> SELECT t1.store_id,              sum(case when t1.is_leader = 1 then 1 else 0 end) leader_cnt,              count(t1.peer_id) region_cnt         FROM information_schema.tikv_region_peers t1        GROUP BY t1.store_id;+----------+------------+------------+| store_id | leader_cnt | region_cnt |+----------+------------+------------+|        1 |      27292 |      81867 ||        2 |      27292 |      81867 ||        3 |      27292 |      81867 |+----------+------------+------------+3 rows in set (0.00 sec)

复制代码

通过系统视图 information_schema.tikv_region_peers 可查看各个 TiKV 节点中 Leader 与 Region 副本的分布情况。由结果可知，各 TiKV 节点中 Leader 与 Region 副本数量分布均匀。

2.2. 监控面板查看 region 分布

通过 Grafana 监控面板（路径：kruidb-Overview->TiKV），可查看各个 TiKV 节点中 Leader 与 Region 副本的分布情况。面板显示与系统视图 information_schema.tikv_region_peers 结果一致。

2.3. 系统视图 gc_delete_range

mysql.gc_delete_range 执行了 drop/truncate 后，需要被 GC worker 定期物理删除的 Key-Value 范围段；
mysql.gc_delete_range_done 已经被 GC worker 物理删除的 Key-Value 范围段。

系统视图显示存在大量需要被物理删除，而由于 GC worker 失败未删除的数据。

2.4. 查看 store 评分

通过 Grafana 监控面板（路径：kruidb-PD->Statistics-balance），可查看各个 TiKV 节点 Leader 与 Region 评分。PD 调度器会优先将 Leader 与 Region 调度到评分较低的 TiKV 节点中。

面板显示，各个 TiKV 节点的 Leader 评分较均衡。而 store-2 与 store-3 因空间不足，Region 评分较高。

3. 问题处理

通过以上各系统视图与监控面板初步判断，由于 GC Woker 执行失败，导致大量本应物理删除的数据，未被物理删除，从而占用大量存储空间。

通过查询 ASKTUG，断定由于触发 GC bug #11903 ，原文链接：TiDB 节点大量[gc worker] delete range failed 报错信息。

临时解决方案：可通过禁用 gc.enable-compaction-filter，并重启集群。
永久解决方案：升级 TiDB 集群版本，永久解决。

3.1. 禁用 gc.enable-compaction-filter

在线修改 TiKV 配置
修改持久化配置文件

为避免 SET CONFIG 在线修改的配置，被 tiup reload 所覆盖。需要修改持久化配置文件。

[tidb@tiup-console ~]$ tiup cluster edit-config kruidb
server_configs:  tikv:    gc.enable-compaction-filter: false

复制代码

[tidb@tiup-console ~]$ tiup cluster reload kruidb[tidb@tiup-console ~]$ tiup cluster stop  kruidb[tidb@tiup-console ~]$ tiup cluster start kruidb

复制代码

3.2. 增加调度

调整 PD 调度参数，以提高调度速度。

[tidb@tiup-console ~]$ find ./ |grep pd-ctl./.tiup/components/ctl/v5.3.0/pd-ctl
[tidb@tiup-console ~]$ tiup ctl:v6.1.0 pd -u http://192.168.72.11:2379 -i## 1. 查看 scheduler 配置» config show scheduler{  "replication": {    "enable-placement-rules": "true",    "enable-placement-rules-cache": "false",    "isolation-level": "",    "location-labels": "",    "max-replicas": 3,    "strictly-match-label": "false"  },  "schedule": {    "enable-cross-table-merge": "true",    "enable-joint-consensus": "true",    "high-space-ratio": 0.7,    "hot-region-cache-hits-threshold": 3,    "hot-region-schedule-limit": 4,    "hot-regions-reserved-days": 7,    "hot-regions-write-interval": "10m0s",    "leader-schedule-limit": 4,    "leader-schedule-policy": "count",    "low-space-ratio": 0.8,    "max-merge-region-keys": 200000,    "max-merge-region-size": 20,    "max-pending-peer-count": 64,    "max-snapshot-count": 64,    "max-store-down-time": "30m0s",    "max-store-preparing-time": "48h0m0s",    "merge-schedule-limit": 8,    "patrol-region-interval": "10ms",    "region-schedule-limit": 2048,    "region-score-formula-version": "v2",    "replica-schedule-limit": 64,    "split-merge-interval": "1h0m0s",    "tolerant-size-ratio": 0  }}»
## 2. 提高 Leader 调度器数量» config set leader-schedule-limit 8
## 3. 提高 region 调度器数量» config set region-schedule-limit 4096

复制代码

3.3. 结果验证

12 小时候，检查 Grafana 各监控面板，多个 TiKV 节点存储空间占用已达到均衡，且空间占用由原来的 3T 下降到 500G。

发布于: 刚刚阅读数: 1

原文链接:【http://xie.infoq.cn/article/47d9e3e9ba8501ab08ba9f32d】。文章转载请联系作者。

TiDB 社区干货传送门

关注

TiDB 社区官网:https://tidb.net/ 2021-12-15 加入

TiDB 社区干货传送门是由 TiDB 社区中布道师组委会自发组织的 TiDB 社区优质内容对外宣布的栏目，旨在加深 TiDBer 之间的交流和学习。一起构建有爱、互助、共创共建的 TiDB 社区 https://tidb.net/

发布

暂无评论

创作场景

记一次因 GC bug 导致 TiKV 存储占用不均的问题处理

1. 问题描述

1.1. 环境描述

1.2. 问题现象

1.2.1. Dashboard 日志

1.2.2. TiKV 空间占用不均

2. 问题分析

2.1. 系统视图查看 region 分布

2.2. 监控面板查看 region 分布

2.3. 系统视图 gc_delete_range

2.4. 查看 store 评分

3. 问题处理

3.1. 禁用 gc.enable-compaction-filter

3.2. 增加调度

3.3. 结果验证

TiDB 社区干货传送门

评论