作者: Aunt-Shirly 原文来源:https://tidb.net/blog/c25328d1
作为一个分布式数据库,扩缩容是 TiDB 集群最常见的运维操作之一。本系列文章,我们将基于 v7.5.0 具体介绍扩缩容操作的具体原理、相关配置及常见问题。 扩缩容从集群视角考虑,主要需要考虑的是扩缩容完成后,集群数据通过调度,让所有在线 tikv 的资源使用到达一个均衡的状态。在这个过程中,主要涉及以下两个关键步骤:
本系列文章我们将围绕以上两个逻辑,重点介绍扩缩容过程中的核心模块及常见问题,分为以下几个部分:
本文我们重点介绍扩缩容过程中,TiKV 侧调度消费的原理及常见问题。对于任何需要在 TiKV 间搬迁数据的调度消费原理,都可以参考本文。
TiKV 副本搬迁原理及常见问题
TiKV 之间的副本搬迁一般出现在:
扩容新节点,需要将数据在新老 tikv 节点做 balance, 老节点上的数据会往新节点搬迁。在 tikv 数量 VS 新节点数量比较大的情况下,新节点的写入压力最可能成为瓶颈
下线节点,下线过程中需要将下线 kv 上的数据搬迁到现有 tikv 中
热点调度及正常的 balance-region 调度等
Placement-rule 突然发生变更(副本规则发生变更)
副本搬迁的完整步骤在第一章已经介绍过,下面我们重点介绍一下副本搬迁的 add learner 和 remove learner 操作。
Add learner 步骤概览
接下来,我们从 TiKV 中线程池的视角,看一下 add learner peer
具体是怎么操作的:
首先 tikv 由 pd-worker
处理 来自 PD 的 hearbeat, PD 在收到 add learner peer
请求后,将这个消息交给了 raftstore
Raftstore 收到 add learner
操作的请求后,作为 region 的 leader 主要做了以下两件事情:
在 raft-group
里 广播 add learner peer
的基本信息
要加 learner
的 tikv 节点也会在本地将这个 region 的元数据创建出来
交给 apply 线程池 apply 这条 add learner
的 admin 消息
Apply 线程池在收到 add learner
这个消息后,更新本地 region 元数据信息后,开始生成 snapshot
并发送给目标 tikv, 这里主要涉及到以下几个线程池:
Apply 收到 snap-generator
生成 snapshot 完成的通知后,准备 Task::Send
消息给 snap-handler
, snap-handler
处理该任务消息,最终发给 snap-sender
线程池,让其发送 snapshot
给 leaner
节点
Learner 节点所在 TiKV 的 GRPC 线程池收到 snapshot 的消息后,准备 Task::REcv
任务给 snap-handler
线程池,snap-hander
处理该任务消息,最终将收 snapshot 的工作交接给 snap-sender
执行。
snap-sender
收取完 snapshot 数据后,通知 apply 线程池 snapshot 文件收集完毕
Apply 线程发送给 region-worker
, 让其去将 snapshot apply 到 rocksdb 里面。
region-worker
处理 apply-snapshot 消息,让 rocksdb 去 apply snapshot 到本地。
以上步骤中,只要任何一个步骤卡住,都会造成 add learner 操作耗时上升。一般的,当我们发现 add-learner 超时时,可以将涉及数据搬迁的关键节点日志拿出来分析。
Add leaner 时 Leader 所在节点日志示例
Step 1: receive AddLeader operator from PD
Step1: receive AddLeader operator from PD:
// thread-id:33 name: pd-worker
// handle heartbeat from PD:change peer: add learner peer 166543143 store_id: 166543141
// broadcast messages to raft-group
[2024/07/09 16:25:23.106 +08:00] [INFO] [pd.rs:1648] ["try to change peer"] [changes="[peer { id: 166543143 store_id: 166543141 role: Learner } change_type: AddLearnerNode]"] [region_id=142725] [thread_id=33]
复制代码
Step 2: broadcast config change
Step2: broadcast config change
// thread-id:367 name: raftstore_* ??
[2024/07/09 16:25:23.106 +08:00] [INFO] [peer.rs:4801] ["propose conf change peer"] [kind=Simple] [changes="[change_type: AddLearnerNode peer { id: 166543143 store_id: 166543141 role: Learner }]"] [peer_id=142728] [region_id=142725] [thread_id=367]
// thread-id:369 name: apply-?
[2024/07/09 16:25:23.107 +08:00] [INFO] [apply.rs:1699] ["execute admin command"] [command="cmd_type: ChangePeerV2 change_peer_v2 { changes { change_type: AddLearnerNode peer { id: 166543143 store_id: 166543141 role: Learner } } }"] [index=18] [term=14] [peer_id=142728] [region_id=142725] [thread_id=369]
[2024/07/09 16:25:23.107 +08:00] [INFO] [apply.rs:2293] ["exec ConfChangeV2"] [epoch="conf_ver: 5 version: 575"] [kind=Simple] [peer_id=142728] [region_id=142725] [thread_id=369]
[2024/07/09 16:25:23.107 +08:00] [INFO] [apply.rs:2474] ["conf change successfully"] ["current region"="id: 142725 start_key: …end_key: region_epoch { conf_ver: 6 version: 575 } peers { id: 142726 store_id: 1 } peers { id: 142727 store_id: 2 } peers { id: 142728 store_id: 3 } peers { id: 166543143 store_id: 166543141 role: Learner }"] ["original region"="id: 142725 start_key: end_key: region_epoch { conf_ver: 5 version: 575 } peers { id: 142726 store_id: 1 } peers { id: 142727 store_id: 2 } peers { id: 142728 store_id: 3 }"] [changes="[change_type: AddLearnerNode peer { id: 166543143 store_id: 166543141 role: Learner }]"] [peer_id=142728] [region_id=142725] [thread_id=369]
[2024/07/09 16:25:23.107 +08:00] [INFO] [raft.rs:2660] ["switched to configuration"] [config="Configuration { voters: Configuration { incoming: Configuration { voters: {142728, 142726, 142727} }, outgoing: Configuration { voters: {} } }, learners: {166543143}, learners_next: {}, auto_leave: false }"] [raft_id=142728] [region_id=142725] [thread_id=367]
[2024/07/09 16:25:23.107 +08:00] [INFO] [peer.rs:4055] ["notify pd with change peer region"] [region="id: 142725 start_key: end_key: region_epoch { conf_ver: 6 version: 575 } peers { id: 142726 store_id: 1 } peers { id: 142727 store_id: 2 } peers { id: 142728 store_id: 3 } peers { id: 166543143 store_id: 166543141 role: Learner }"] [peer_id=142728] [region_id=142725] [thread_id=367]
复制代码
Step 3: prepare snapshot
Step3: prepare snapshot
// raftstore: process request get_snapshot
[2024/07/09 16:25:23.109 +08:00] [INFO] [peer_storage.rs:549] ["requesting snapshot"] [request_peer=166543143] [request_index=0] [peer_id=142728] [region_id=142725] [thread_id=367]
复制代码
Step 3.1: generate snapshot
Step3.1: generate snapshot
// thread-id:349 name: snap-generator thread-num:snap_generator_pool_size
[2024/07/09 16:25:23.111 +08:00] [INFO] [snap.rs:916] ["scan snapshot of one cf"] [size=0] [key_count=0] [cf=default] [snapshot=/DATA/disk2/shirly/tiup/tidb-data/tikv-20161/snap/gen_142725_14_18_(default|lock|write).sst] [region_id=142725] [thread_id=349]
[2024/07/09 16:25:23.111 +08:00] [INFO] [snap.rs:916] ["scan snapshot of one cf"] [size=0] [key_count=0] [cf=lock] [snapshot=/DATA/disk2/shirly/tiup/tidb-data/tikv-20161/snap/gen_142725_14_18_(default|lock|write).sst] [region_id=142725] [thread_id=349]
[2024/07/09 16:25:24.089 +08:00] [INFO] [io.rs:231] ["build_sst_cf_file_list builds 1 files in cf write. Total keys 645299, total size 100663198. raw_size_per_file 104857600, total takes 977.622872ms"] [thread_id=349]
[2024/07/09 16:25:24.097 +08:00] [INFO] [snap.rs:916] ["scan snapshot of one cf"] [size=100663198] [key_count=645299] [cf=write] [snapshot=/DATA/disk2/shirly/tiup/tidb-data/tikv-20161/snap/gen_142725_14_18_(default|lock|write).sst] [region_id=142725] [thread_id=349]
[2024/07/09 16:25:24.097 +08:00] [INFO] [snap.rs:1090] ["scan snapshot"] [takes=987.083966ms] [size=14988542] [key_count=645299] [snapshot=/DATA/disk2/shirly/tiup/tidb-data/tikv-20161/snap/gen_142725_14_18_(default|lock|write).sst] [region_id=142725] [thread_id=349]
复制代码
Step 3.2: send snapshot
Step3.2: send snapshot
thread_id:366 raftstore
[2024/07/09 16:25:24.098 +08:00] [INFO] [peer_storage.rs:510] ["start sending snapshot"] [request_peer=166543143] [peer_id=142728] [region_id=142725] [thread_id=366]
// thread-id:328 name: snap-generator
[2024/07/09 16:25:24.099 +08:00] [INFO] [snap.rs:679] ["set_snapshot_meta total cf files count: 3"] [thread_id=328]
// thread_id:396,name:snap-sender, pool_size: 4(unconfigable)
[2024/07/09 16:25:24.239 +08:00] [INFO] [snap.rs:584] ["sent snapshot"] [duration=137.340853ms] [size=14988542] [snap_key=142725_14_18] [region_id=142725] [thread_id=396]
[2024/07/09 16:25:24.239 +08:00] [INFO] [peer.rs:2024] ["report snapshot status"] [status=Finish] [to="id: 166543143 store_id: 166543141 role: Learner"] [peer_id=142728] [region_id=142725] [thread_id=367]
复制代码
Add learner 时 Learner 所在节点日志示例
Step 1: receive AddLeader raftmessage from leader:create empty peer
Step1: receive AddLeader raftmessage from leader:create empty peer
thread_id:362 raftstore:handle raft messages
[2024/07/09 16:25:23.108 +08:00] [INFO] [peer.rs:325] ["replicate peer"] [create_by_peer_store_id=3] [create_by_peer_id=142728] [store_id=166543141] [peer_id=166543143] [region_id=142725] [thread_id=362]
[2024/07/09 16:25:23.108 +08:00] [INFO] [raft.rs:2660] ["switched to configuration"] [config="Configuration { voters: Configuration { incoming: Configuration { voters: {} }, outgoing: Configuration { voters: {} } }, learners: {}, learners_next: {}, auto_leave: false }"] [raft_id=166543143] [region_id=142725] [thread_id=362]
[2024/07/09 16:25:23.108 +08:00] [INFO] [raft.rs:1127] ["became follower at term 0"] [term=0] [raft_id=166543143] [region_id=142725] [thread_id=362]
[2024/07/09 16:25:23.108 +08:00] [INFO] [raft.rs:388] [newRaft] [peers="Configuration { incoming: Configuration { voters: {} }, outgoing: Configuration { voters: {} } }"] ["last term"=0] ["last index"=0] [applied=0] [commit=0] [term=0] [raft_id=166543
143] [region_id=142725] [thread_id=362]
[2024/07/09 16:25:23.108 +08:00] [INFO] [raw_node.rs:315] ["RawNode created with id 166543143."] [id=166543143] [raft_id=166543143] [region_id=142725] [thread_id=362]
[2024/07/09 16:25:23.108 +08:00] [INFO] [raft.rs:1362] ["received a message with higher term from 142728"] ["msg type"=MsgHeartbeat] [message_term=14] [term=0] [from=142728] [raft_id=166543143] [region_id=142725] [thread_id=362]
[2024/07/09 16:25:23.108 +08:00] [INFO] [raft.rs:1127] ["became follower at term 14"] [term=14] [raft_id=166543143] [region_id=142725] [thread_id=362]
复制代码
Step 2: receive snapshot file from leader
Step2: receive snapshot file from leader
// thread-id:392 , name:snap_sender
[2024/07/09 16:25:24.224 +08:00] [INFO] [snap.rs:342] ["saving snapshot file"] [file=/DATA/disk1/shirly/tiup/tidb-data/tikv-20164/snap/rev_142725_14_18_(default|lock|write).sst] [snap_key=142725_14_18] [thread_id=392]
[2024/07/09 16:25:24.235 +08:00] [INFO] [snap.rs:350] ["saving all snapshot files"] [takes=134.603179ms] [snap_key=142725_14_18] [thread_id=392]
复制代码
Step 3: restore snapshot file from leader
Step3: restore snapshot file from leader
// thread-id:362,raftstore
[2024/07/09 16:25:24.235 +08:00] [INFO] [raft_log.rs:662] ["log [committed=0, persisted=0, applied=0, unstable.offset=1, unstable.entries.len()=0] starts to restore snapshot [index: 18, term: 14]"] [snapshot_term=14] [snapshot_index=18] [log="committe
d=0, persisted=0, applied=0, unstable.offset=1, unstable.entries.len()=0"] [raft_id=166543143] [region_id=142725] [thread_id=362]
[2024/07/09 16:25:24.235 +08:00] [INFO] [raft.rs:2660] ["switched to configuration"] [config="Configuration { voters: Configuration { incoming: Configuration { voters: {142728, 142726, 142727} }, outgoing: Configuration { voters: {} } }, learners: {16
6543143}, learners_next: {}, auto_leave: false }"] [raft_id=166543143] [region_id=142725] [thread_id=362]
[2024/07/09 16:25:24.235 +08:00] [INFO] [raft.rs:2644] ["restored snapshot"] [snapshot_term=14] [snapshot_index=18] [last_term=14] [last_index=18] [commit=18] [raft_id=166543143] [region_id=142725] [thread_id=362]
[2024/07/09 16:25:24.235 +08:00] [INFO] [raft.rs:2525] ["[commit: 18, term: 14] restored snapshot [index: 18, term: 14]"] [snapshot_term=14] [snapshot_index=18] [commit=18] [term=14] [raft_id=166543143] [region_id=142725] [thread_id=362]
[2024/07/09 16:25:24.235 +08:00] [INFO] [peer_storage.rs:607] ["begin to apply snapshot"] [peer_id=166543143] [region_id=142725] [thread_id=362]
[2024/07/09 16:25:24.235 +08:00] [INFO] [peer_storage.rs:690] ["apply snapshot with state ok"] [for_witness=false] [state="applied_index: 18 truncated_state { index: 18 term: 14 }"] [region="id: 142725 start_key:end_key: region_epoch { conf_ver: 6 version: 575 } peers { id: 142726
store_id: 1 } peers { id: 142727 store_id: 2 } peers { id: 142728 store_id: 3 } peers { id: 166543143 store_id: 166543141 role: Learner }"] [peer_id=166543143] [region_id=142725] [thread_id=362]
[2024/07/09 16:25:24.236 +08:00] [INFO] [peer.rs:5041] ["snapshot is persisted"] [destroy_regions="[]"] [region="id: 142725 start_key: end_key: region_epoch { conf_ver: 6 version: 575 } peers { id: 142726 store_id: 1 } peers { id: 142727 store_id: 2 } peers { id: 142728 store_id: 3
} peers { id: 166543143 store_id: 166543141 role: Learner }"] [peer_id=166543143] [region_id=142725] [thread_id=362]
// thread-id:341 name: region_worker:1 thread unconfigable
复制代码
Step 4: apply snapshot to rocksdb
Step4: apply snapshot to rocksdb
[2024/07/09 16:25:24.236 +08:00] [INFO] [region.rs:455] ["begin apply snap data"] [peer_id=166543143] [region_id=142725] [thread_id=341]
[2024/07/09 16:25:24.248 +08:00] [INFO] [region.rs:503] ["apply new data"] [time_takes=11.639116ms] [region_id=142725] [thread_id=341]
[thread_id=367]
复制代码
常见问题
Add leaner 卡住最常见的原因一般有以下几个:
Add learner 陷入循环
Add learner 卡住后可能导致 tikv 这边进入死循环而导致副本搬迁陷入僵局,以下线为例来说:
PD rule checker 执行时,fix-unhealthy peer 时生成 replace-rule-offline(leader)-peer
PD 下发 add learner peer
TiKV 开始执行 add learner peer
PD 20(根据 region size 估算) 分钟后依旧没有收到 add learner peer
成功的消息,认为该任务超时,取消当前 replace-rule-offline-peer
operator 。
TiKV 在 30 分钟后 add learner
成功 ,TiKV 上报心跳给 PD,当前这个 region 会有一个 learner 节点。
PD rule-cheker 执行时,先检查 orphan-peer
, 就把之前添加成功的 learner 删除了
PD rule checker 再次执行时,回到步骤 1 , 至此进入循环。
线程池相关问题
region_worker
线程池相关问题
Region worker 主要职责为处理 snapshot generate/apply/destroy 消息
线程名:region-worker
Pending-tasks-label:“snapshot-worker”
线程数:1(不可配置)
处理任务:(Task<EK::Snapshot>
)
Task::Gen
: 生成 snapshot, 会转发给 snap-generator
线程池处理
Task::Apply
: apply snapshot,会让 rocksdb 将已经收到的 snapshot 文件 apply 到 rocksdb 中。‘
Task::Destroy
删除 snapshot, 主要是删除一段连续的范围,也就是删除 leaner 时使用。
常见问题:
region-worker
CPU 成为瓶颈:
现象:
从 CPU 粒度看 region worker CPU 已经跑满(目前没有单独的监控,可以用 tikv->CPU 下面选一个面板编辑公式得到)
- 从 tikv-details -> task -> worker pending tasks->snapshot-worker 的数量会堆积
- workaround:因为 region worker CPU 只有一个线程且无法编辑,因此这一块没有特别好的方式。主要根据 region worker CPU 跑满的具体原因进行对应的限流:
- Generate snapshot 压力大:适当将上面需要搬迁的 region 的 leader transfer 走一些。极端情况下可以使用 evict-leader
- Apply snapshot 压力大:在 pd 侧用 store limit add-peer 进行限流
- 删除 snapshot 压力大:在 PD 侧 store limit remove-peer 限流
复制代码
snap-generator
线程池相关问题
该线程池主要用于生成具体的 snapshot,主要需要注意下面两个瓶颈
因此,当注意到以上两个地方遇到瓶颈时,可以通过修改对应配置项进行调整。如何判断以上两个到达性能瓶颈:
可以通过 online config 直接修改,无需重启 tikv 实例:
// Set snap-io-max-bytes-per-sec online
mysql> set config "127.0.0.1:20165" `server.snap-io-max-bytes-per-sec`= '200MiB';
Query OK, 0 rows affected (0.01 sec)
// Show configs
mysql> show config where type='tikv' and name like '%snap-io-max-bytes-per-sec%';
+------+-----------------+----------------------------------+--------+
| Type | Instance | Name | Value |
+------+-----------------+----------------------------------+--------+
| tikv | 127.0.0.1:20160 | server.snap-io-max-bytes-per-sec | 100MiB |
| tikv | 127.0.0.1:20162 | server.snap-io-max-bytes-per-sec | 100MiB |
| tikv | 127.0.0.1:20161 | server.snap-io-max-bytes-per-sec | 100MiB |
| tikv | 127.0.0.1:20164 | server.snap-io-max-bytes-per-sec | 100MiB |
| tikv | 127.0.0.1:20165 | server.snap-io-max-bytes-per-sec | 200MiB |
+------+-----------------+----------------------------------+--------+
5 rows in set (0.02 sec)
// Set snap-generator-pool-size online
mysql> set config "127.0.0.1:20165" `raftstore.snap-generator-pool-size`=6;
Query OK, 0 rows affected (0.01 sec)
// Show configs
mysql> show config where type='tikv' and name like '%snap-generator-pool-size%';
+------+-----------------+------------------------------------+-------+
| Type | Instance | Name | Value |
+------+-----------------+------------------------------------+-------+
| tikv | 127.0.0.1:20160 | raftstore.snap-generator-pool-size | 2 |
| tikv | 127.0.0.1:20162 | raftstore.snap-generator-pool-size | 2 |
| tikv | 127.0.0.1:20161 | raftstore.snap-generator-pool-size | 2 |
| tikv | 127.0.0.1:20164 | raftstore.snap-generator-pool-size | 2 |
| tikv | 127.0.0.1:20165 | raftstore.snap-generator-pool-size | 6 |
+------+-----------------+------------------------------------+-------+
5 rows in set (0.02 sec)
复制代码
snap-handler
&snap_sender
线程池相关问题
snap-handler
线程池主要用于分发调度 send/receive snapshot 相关的任务。而真正的发送和接收 snapshot 任务,则由 snap_sender
这个子线程池来完成。
snap-handler
线程数 1,写死不可编辑
snap-handler
主要处理的任务有三种类型:
Task::Recv
— from: kv_service.snap_scheduler
process by snap_sender
pool
Task::Send
— from: AsyncRaftSender.snap_scheduler
process by snap_sender
pool
Task::RefreshConfigEvent
— from: server.snap_scheduler
控制 snapshot
发送和接收的并发:
concurrent-send-snap-limit 32 by default
concurrent-recv-snap-limit 32 by default
子线程池 snap_sender
线程数 4,也是写死在代码里无法配置。
相关监控:tikv-details->snapshot->snapshot transport speed & snapshot state count
summary
目前 tikv 侧线程池相关可用参数有且只有下面四个,遇到问题时根据上文所述的常见问题特征进行调整:
Ingest sst 导致 Apply wait duration 上涨
现象:apply wait duration 升高,但是 apply 不高且 apply CPU 不忙且增大 apply-pool-size 无用。
原因: ingest sst 会使用到 rocksdb 的 global mutex tikv#5911 会导致 apply wait 上涨。从老 tikv 将数据搬迁到新 tikv 时,删除老的 region 走的也是 ingest sst 的方式 tikv#7794。
Workaround: 业务高峰期降低数据搬迁的速度,必要时 store limit 调整至 0.000001。目前这方面优化还在进行中。
评论