写点什么

记三次升级 TiDB 集群到 v6.1.5 遇到的案例分析过程 & 升级收益

  • 2023-12-22
    北京
  • 本文字数:11367 字

    阅读完需:约 37 分钟

作者: Yifei_Jia 原文来源:https://tidb.net/blog/087114d3


团队升级 TiDB 版本的事情是规划很久了,迟迟没操作还是因为很多预期意外的问题是否有能力覆盖解决。


本文写的时间是 8 月底,今天刚好总结的时候看到了分享给大家以作为版本升级的参考。


我们的业务集群 TiDB 数据量本身是很大,单集群数十 TB 规模,加之业务的重要性,本着非必要不升级的原则。


随着 TiDB 版本迭代,很多新的功能特性和性能提升以及我们遇到的一些 bug 修复,不断吸引着我们作出升级的决定。


文中记录的三则为计划升级 16 套 TiDB 集群中遇到的问题,将性能压力大且符合升级需求的 TiDB 集群升级到 v6.1.5。

问题一:升级完成后其中一台 TiFlash 无法正常重启,一直 oom


问题分析过程:


研发侧报错:1、uncategorized SQLException; SQL state [HY000]; error code [1105]; other error for mpp stream: DB::Exception: Exchange receiver meet error : DB::Exception: Exchange receiver meet error : send async stream request fail; nested exception is java.sql.SQLException: other error for mpp stream: DB::Exception: Exchange receiver meet error : DB::Exception: Exchange receiver meet error : send async stream request fail/span>tidb节点上错误日志:2、[ERROR] 2023-08-23 ****** [http-nio-8080-exec-6] [com.*****.man.web.framework.exception.GlobalErrHandler] [7c25fb69df2800c5] - 通用异常信息:[### Error querying database.  Cause: java.sql.SQLException: TiFlash server timeout### The error may exist in URL [jar:file:/web/servers/tomcat/webapps/ROOT/WEB-INF/lib/***-man-dao-2.0.0-RELEASE.jar!/mapper/tidb/SheetListTidbMapper.xml]### The error may involve *****.man.dao.tidb.SheetListTidbMapper.selectTidbPageByQuery-Inline### The error occurred while setting parameters### SQL: SELECT  ****** ;### Cause: java.sql.SQLException: TiFlash server timeout; uncategorized SQLException; SQL state [HY000]; error code [9012]; TiFlash server timeout; nested exception is java.sql.SQLException: TiFlash server timeout]org.springframework.jdbc.UncategorizedSQLException: ### Error querying database.  Cause: java.sql.SQLException: TiFlash server timeout; uncategorized SQLException; SQL state [HY000]; error code [9012]; TiFlash server timeout; nested exception is java.sql.SQLException: TiFlash server timeout
复制代码


其查询计划为,没有任何合适的索引能走到 tikv 查询数据,只有走 TiFlash。



第一个错误是 MPP 老版本的一个 bug 错误已经修复了,按理不会报这个错误。


后面确认是 TiFlash 升级完成后挂掉不能正常启动,确认了虽然有两个 TiFlash 节点实际上还是单副本运行,所以一直报错。


问题分析及处理建议:


v5.1 的 TiFlash 升级到 v6.1.5 不能保证升级过程中会不会有其他问题,这台机器升级过程是恢复临时占用内存超过机器内存 16G,不能在原地升级完成。


大家以后在执行版本升级时,第一硬件资源要跟上,第二升级的组件要是多副本的。


这样在升级过程中不用等待重新扩缩 TiFlash 同步数据带来的数据迁移时间。

问题二:使用 –offline 参数升级集群无法继续进行,cluster is running and cannot be upgraded offline

ll upgrade component           "tiflash" to "v6.1.5",will upgrade component               "cdc" to "v6.1.5",will upgrade component                "pd" to "v6.1.5",will upgrade component              "tikv" to "v6.1.5",will upgrade component              "tidb" to "v6.1.5",will upgrade component        "prometheus" to "v6.1.5",will upgrade component           "grafana" to "v6.1.5",will upgrade component      "alertmanager" to "",will upgrade component     "node-exporter" to "",will upgrade component "blackbox-exporter" to "".Do you want to continue? [y/N]:(default=N) yUpgrading cluster...Error: cluster is running and cannot be upgraded offline
复制代码



至于为什么会发生这个问题,没去排查了。最后确认的是:tiup 中关闭了 TiFlash 组件,状态也显示已关闭,但服务器上 systemctl status tiflash-9000 显示 running,实际上组件并未正常停止,这个问题也好理解,毕竟是异步关闭的组件。


遇到这个问题处理办法就是去 TiFlash 组件服务器上关闭一下。

问题三:升级期间有一个超大 DDL 在执行,导致升级操作完成后 tidb 计算节点不能提供服务


这个服务直接就不可用了,我们来看个组件的日志情况


排查过程,我们一般都是 grep err 日志,发现错误日志集中在 PD 上,开始在怀疑 PD 有问题


PD 的相关日志:


[2023/08/24 22:51:27.992 +08:00] [ERROR] [client.go:847] ["[pd] getTS error"] [dc-location=global] [stream-addr=http://10.16.57.62:2379] [error="[PD:client:ErrClientGetTSO]rpc error: code = Unknown desc = [PD:tso:ErrGenerateTimestamp]generate timestamp failed, requested pd is not leader of cluster: rpc error: code = Unknown desc = [PD:tso:ErrGenerateTimestamp]generate timestamp failed, requested pd is not leader of cluster"][2023/08/24 22:51:27.993 +08:00] [INFO] [base_client.go:275] ["[pd] cannot update member from this address"] [address=http://10.16.57.106:2379] [error="[PD:client:ErrClientGetLeader]get leader from leader address don't exist error"][2023/08/24 22:51:27.993 +08:00] [INFO] [base_client.go:275] ["[pd] cannot update member from this address"] [address=http://10.16.57.106:2379] [error="[PD:client:ErrClientGetLeader]get leader from leader address don't exist error"][2023/08/24 22:51:27.994 +08:00] [INFO] [base_client.go:275] ["[pd] cannot update member from this address"] [address=http://10.16.57.62:2379] [error="[PD:client:ErrClientGetLeader]get leader from leader address don't exist error"][2023/08/24 22:51:27.994 +08:00] [INFO] [base_client.go:275] ["[pd] cannot update member from this address"] [address=http://10.16.57.62:2379] [error="[PD:client:ErrClientGetLeader]get leader from leader address don't exist error"][2023/08/24 22:51:27.995 +08:00] [INFO] [base_client.go:275] ["[pd] cannot update member from this address"] [address=http://10.16.57.82:2379] [error="[PD:client:ErrClientGetLeader]get leader from leader address don't exist error"][2023/08/24 22:51:27.995 +08:00] [INFO] [base_client.go:275] ["[pd] cannot update member from this address"] [address=http://10.16.57.82:2379] [error="[PD:client:ErrClientGetLeader]get leader from leader address don't exist error"][2023/08/24 22:51:27.995 +08:00] [ERROR] [base_client.go:144] ["[pd] failed updateMember"] [error="[PD:client:ErrClientGetLeader]get leader from [http://10.16.57.106:2379 http://10.16.57.62:2379 http://10.16.57.82:2379] error"][2023/08/24 22:51:28.265 +08:00] [INFO] [base_client.go:275] ["[pd] cannot update member from this address"] [address=http://10.16.57.62:2379] [error="[PD:client:ErrClientGetLeader]get leader from leader address don't exist error"][2023/08/24 22:51:28.266 +08:00] [INFO] [base_client.go:275] ["[pd] cannot update member from this address"] [address=http://10.16.57.82:2379] [error="[PD:client:ErrClientGetLeader]get leader from leader address don't exist error"][2023/08/24 22:51:28.266 +08:00] [ERROR] [base_client.go:144] ["[pd] failed updateMember"] [error="[PD:client:ErrClientGetLeader]get leader from [http://10.16.57.106:2379 http://10.16.57.62:2379 http://10.16.57.82:2379] error"][2023/08/24 22:51:28.266 +08:00] [INFO] [base_client.go:275] ["[pd] cannot update member from this address"] [address=http://10.16.57.82:2379] [error="[PD:client:ErrClientGetLeader]get leader from leader address don't exist error"][2023/08/24 22:51:28.476 +08:00] [ERROR] [client.go:847] ["[pd] getTS error"] [dc-location=global] [stream-addr=http://10.16.57.62:2379] [error="[PD:client:ErrClientGetTSO]rpc error: code = Unknown desc = [PD:tso:ErrGenerateTimestamp]generate timestamp failed, requested pd is not leader of cluster: rpc error: code = Unknown desc = [PD:tso:ErrGenerateTimestamp]generate timestamp failed, requested pd is not leader of cluster"]
复制代码


大致情况就是执行任何操作请求都提示没有 pd leader 能处理。


TiDB 的日志:


[2023/08/24 23:16:29.989 +08:00] [INFO] [base_client.go:275] ["[pd] cannot update member from this address"] [address=http://10.16.57.106:2379] [error="[PD:client:ErrClientGetMember]error:rpc error: code = Unknown desc = [PD:server:ErrServerNotStarted]server not started target:10.16.57.106:2379 status:READY: error:rpc error: code = Unknown desc = [PD:server:ErrServerNotStarted]server not started target:10.16.57.106:2379 status:READY"][2023/08/24 23:16:29.990 +08:00] [INFO] [base_client.go:275] ["[pd] cannot update member from this address"] [address=http://10.16.57.62:2379] [error="[PD:client:ErrClientGetMember]error:rpc error: code = Unknown desc = [PD:server:ErrServerNotStarted]server not started target:10.16.57.62:2379 status:READY: error:rpc error: code = Unknown desc = [PD:server:ErrServerNotStarted]server not started target:10.16.57.62:2379 status:READY"][2023/08/24 23:16:29.990 +08:00] [INFO] [base_client.go:275] ["[pd] cannot update member from this address"] [address=http://10.16.57.82:2379] [error="[PD:client:ErrClientGetMember]error:rpc error: code = Unknown desc = [PD:server:ErrServerNotStarted]server not started target:10.16.57.82:2379 status:READY: error:rpc error: code = Unknown desc = [PD:server:ErrServerNotStarted]server not started target:10.16.57.82:2379 status:READY"][2023/08/25 01:07:49.023 +08:00] [INFO] [printer.go:34] ["Welcome to TiDB."] ["Release Version"=v6.1.5] [Edition=Community] ["Git Commit Hash"=73d82e330b02a39f74073d98daefbadd7deab9b9] ["Git Branch"=heads/refs/tags/v6.1.5] ["UTC Build Time"="2023-02-22 06:21:24"] [GoVersion=go1.19.5] ["Race Enabled"=false] ["Check Table Before Drop"=false] ["TiKV Min Version"=v3.0.0-60965b006877ca7234adaced7890d7b029ed1306][2023/08/25 01:07:49.024 +08:00] [INFO] [printer.go:48] ["loaded config"] [config="{\"host\":\"0.0.0.0\",\"advertise-address\":\"10.16.57.82\",\"port\":4000,\"cors\":\"\",\"store\":\"tikv\",\"path\":\"10.16.57.82:2379,10.16.57.106:2379,10.16.57.62:2379\",\"socket\":\"/tmp/tidb-4000.sock\",\"lease\":\"45s\",\"run-ddl\":true,\"split-table\":true,\"token-limit\":1000,\"oom-use-tmp-storage\":true,\"tmp-storage-path\":\"/tmp/0_tidb/MC4wLjAuMDo0MDAwLzAuMC4wLjA6MTAwODA=/tmp-storage\",\"tmp-storage-quota\":-1,\"server-version\":\"\",\"version-comment\":\"\",\"tidb-edition\":\"\",\"tidb-release-version\":\"\",\"log\":{\"level\":\"info\",\"format\":\"text\",\"disable-timestamp\":null,\"enable-timestamp\":null,\"disable-error-stack\":null,\"enable-error-stack\":null,\"file\":{\"filename\":\"/data/tidb-deploy/tidb-4000/log/tidb.log\",\"max-size\":300,\"max-days\":0,\"max-backups\":0},\"enable-slow-log\":true,\"slow-query-file\":\"/data/tidb-deploy/tidb-4000/log/tidb_slow_query.log\",\"slow-threshold\":1000,\"expensive-threshold\":10000,\"record-plan-in-slow-log\":1,\"query-log-max-len\":4096},\"instance\":{\"tidb_general_log\":false,\"tidb_pprof_sql_cpu\":false,\"ddl_slow_threshold\":300,\"tidb_expensive_query_time_threshold\":60,\"tidb_enable_slow_log\":true,\"tidb_slow_log_threshold\":1000,\"tidb_record_plan_in_slow_log\":1,\"tidb_check_mb4_value_in_utf8\":true,\"tidb_force_priority\":\"NO_PRIORITY\",\"tidb_memory_usage_alarm_ratio\":0.8,\"tidb_enable_collect_execution_info\":true,\"plugin_dir\":\"/data/deploy/plugin\",\"plugin_load\":\"\"},\"security\":{\"skip-grant-table\":false,\"ssl-ca\":\"\",\"ssl-cert\":\"\",\"ssl-key\":\"\",\"cluster-ssl-ca\":\"\",\"cluster-ssl-cert\":\"\",\"cluster-ssl-key\":\"\",\"cluster-verify-cn\":null,\"spilled-file-encryption-method\":\"plaintext\",\"enable-sem\":false,\"auto-tls\":false,\"tls-version\":\"\",\"rsa-key-size\":4096,\"secure-bootstrap\":false},\"status\":{\"status-host\":\"0.0.0.0\",\"metrics-addr\":\"\",\"status-port\":10080,\"metrics-interval\":15,\"report-status\":true,\"record-db-qps\":false,\"grpc-keepalive-time\":10,\"grpc-keepalive-timeout\":3,\"grpc-concurrent-streams\":1024,\"grpc-initial-window-size\":2097152,\"grpc-max-send-msg-size\":2147483647},\"performance\":{\"max-procs\":0,\"max-memory\":0,\"server-memory-quota\":0,\"memory-usage-alarm-ratio\":0.8,\"stats-lease\":\"3s\",\"stmt-count-limit\":5000,\"feedback-probability\":0,\"query-feedback-limit\":512,\"pseudo-estimate-ratio\":0.8,\"force-priority\":\"NO_PRIORITY\",\"bind-info-lease\":\"3s\",\"txn-entry-size-limit\":6291456,\"txn-total-size-limit\":104857600,\"tcp-keep-alive\":true,\"tcp-no-delay\":true,\"cross-join\":true,\"distinct-agg-push-down\":false,\"projection-push-down\":false,\"max-txn-ttl\":3600000,\"index-usage-sync-lease\":\"0s\",\"plan-replayer-gc-lease\":\"10m\",\"gogc\":100,\"enforce-mpp\":false,\"stats-load-concurrency\":5,\"stats-load-queue-size\":1000,\"enable-stats-cache-mem-quota\":false,\"committer-concurrency\":128,\"run-auto-analyze\":true},\"prepared-plan-cache\":{\"enabled\":true,\"capacity\":100,\"memory-guard-ratio\":0.1},\"opentracing\":{\"enable\":false,\"rpc-metrics\":false,\"sampler\":{\"type\":\"const\",\"param\":1,\"sampling-server-url\":\"\",\"max-operations\":0,\"sampling-refresh-interval\":0},\"reporter\":{\"queue-size\":0,\"buffer-flush-interval\":0,\"log-spans\":false,\"local-agent-host-port\":\"\"}},\"proxy-protocol\":{\"networks\":\"\",\"header-timeout\":5},\"pd-client\":{\"pd-server-timeout\":3},\"tikv-client\":{\"grpc-connection-count\":4,\"grpc-keepalive-time\":10,\"grpc-keepalive-timeout\":3,\"grpc-compression-type\":\"none\",\"commit-timeout\":\"41s\",\"async-commit\":{\"keys-limit\":256,\"total-key-size-limit\":4096,\"safe-window\":2000000000,\"allowed-clock-drift\":500000000},\"max-batch-size\":128,\"overload-threshold\":200,\"max-batch-wait-time\":0,\"batch-wait-size\":8,\"enable-chunk-rpc\":true,\"region-cache-ttl\":600,\"store-limit\":0,\"store-liveness-timeout\":\"1s\",\"copr-cache\":{\"capacity-mb\":1000},\"ttl-refreshed-txn-size\":33554432,\"resolve-lock-lite-threshold\":16},\"binlog\":{\"enable\":false,\"ignore-error\":false,\"write-timeout\":\"15s\",\"binlog-socket\":\"\",\"strategy\":\"range\"},\"compatible-kill-query\":false,\"plugin\":{\"dir\":\"/data/deploy/plugin\",\"load\":\"\"},\"pessimistic-txn\":{\"max-retry-count\":256,\"deadlock-history-capacity\":10,\"deadlock-history-collect-retryable\":false,\"pessimistic-auto-commit\":false},\"check-mb4-value-in-utf8\":true,\"max-index-length\":3072,\"index-limit\":64,\"table-column-count-limit\":1017,\"graceful-wait-before-shutdown\":0,\"alter-primary-key\":false,\"treat-old-version-utf8-as-utf8mb4\":true,\"enable-table-lock\":false,\"delay-clean-table-lock\":0,\"split-region-max-num\":1000,\"top-sql\":{\"receiver-address\":\"\"},\"repair-mode\":false,\"repair-table-list\":[],\"isolation-read\":{\"engines\":[\"tikv\",\"tiflash\",\"tidb\"]},\"max-server-connections\":1000,\"new_collations_enabled_on_first_bootstrap\":true,\"experimental\":{\"allow-expression-index\":false},\"enable-collect-execution-info\":true,\"skip-register-to-dashboard\":false,\"enable-telemetry\":false,\"labels\":{},\"enable-global-index\":false,\"deprecate-integer-display-length\":false,\"enable-enum-length-limit\":true,\"stores-refresh-interval\":60,\"enable-tcp4-only\":false,\"enable-forwarding\":false,\"max-ballast-object-size\":0,\"ballast-object-size\":0,\"enable-global-kill\":true,\"enable-batch-dml\":false,\"mem-quota-query\":2147483648,\"oom-action\":\"cancel\"}"][2023/08/25 01:07:49.024 +08:00] [INFO] [main.go:343] ["disable Prometheus push client"]这个地方的日志也显示TiDB节点的配置正常加载正常启动,但是就是未正常打开4000端口对外提供服务[2023/08/25 01:07:51.410 +08:00] [INFO] [session.go:3311] ["CRUCIAL OPERATION"] [conn=0] [schemaVersion=3242] [cur_db=] [sql="ALTER TABLE mysql.stats_meta ADD COLUMN snapshot BIGINT(64) UNSIGNED NOT NULL DEFAULT 0"] [user=][2023/08/25 01:07:51.442 +08:00] [INFO] [ddl_worker.go:339] ["[ddl] add DDL jobs"] ["batch count"=1] [jobs="ID:4591, Type:add column, State:queueing, SchemaState:none, SchemaID:3, TableID:21, RowCount:0, ArgLen:3, start time: 2023-08-25 01:07:51.419 +0800 CST, Err:<nil>, ErrCount:0, SnapshotVersion:0; "][2023/08/25 01:07:51.444 +08:00] [INFO] [ddl.go:632] ["[ddl] start DDL job"] [job="ID:4591, Type:add column, State:queueing, SchemaState:none, SchemaID:3, TableID:21, RowCount:0, ArgLen:3, start time: 2023-08-25 01:07:51.419 +0800 CST, Err:<nil>, ErrCount:0, SnapshotVersion:0"] [query="ALTER TABLE mysql.stats_meta ADD COLUMN snapshot BIGINT(64) UNSIGNED NOT NULL DEFAULT 0"]怀疑是未正常执行升级
复制代码


因为是按计划升级,也没有在 TiDB 上执行 DDL,从未怀疑这一点是问题的关键。然而事实的关键往往就在你没发现的可能忽略地方,我们排查了 tikv 都是正常的,间接也证明了 PD 应该也是正常的,否则多副本的存储服务肯定也不行。


回过头来排查 tidb 本身,突然发现一个从 MySQL 同步到这个 TiDB 集群的一个物理加索引 DDL。


问题逐渐清晰,因为 MySQL 有分库分表,并且只保存着近半年的热数据执行很快就完成。TiDB 里面有着全量的数据聚合到一张表,这个加索引的物理 DDL 也自动同步到了 TiDB 执行,并且执行时间很长。



因为有了整体问题思路,再回头看 TiDB 日志


[2023/08/24 22:00:27.008 +08:00] [WARN] [session.go:1533] ["run statement failed"] [conn=5897899] [schemaVersion=3240] [error="[ddl:1061]index already exist idx_create; a background job is trying to add the same index, please check by `ADMIN SHOW DDL JOBS`"] [session="{\n  \"currDBName\": \"***_voucher\",\n  \"id\": 5897899,\n  \"status\": 2,\n  \"strictMode\": true,\n  \"user\": {\n    \"Username\": \"tidb_cold_w\",\n    \"Hostname\": \"10.16.56.178\",\n    \"CurrentUser\": false,\n    \"AuthUsername\": \"tidb_cold_w\",\n    \"AuthHostname\": \"10.%.%.%\"\n  }\n}"][2023/08/24 22:00:27.008 +08:00] [INFO] [conn.go:877] ["command dispatched failed"] [conn=5897899] [connInfo="id:5897899, addr:10.16.56.178:28898 status:10, collation:utf8_general_ci, user:tidb_cold_w"] [command=Query] [status="inTxn:0, autocommit:1"] [sql="ALTER TABLE `voucher_detail` ADD INDEX idx_create(created, vender_id)"] [txn_mode=PESSIMISTIC] [err="[ddl:1061]index already exist idx_create; a background job is trying to add the same index, please check by `ADMIN SHOW DDL JOBS`\ngithub.com/pingcap/errors.AddStack\n\t/nfs/cache/mod/github.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/errors.go:174\ngithub.com/# cat tidb.log |grep "run reorg job wait timeout"[2023/08/24 22:09:51.308 +08:00] [INFO] [reorg.go:284] ["[ddl] run reorg job wait timeout"] [waitTime=5s] [elementType=_idx_] [elementID=11] [totalAddedRowCount=7104213] [doneKey=6774522551193595979] [][2023/08/24 22:09:56.356 +08:00] [INFO] [reorg.go:284] ["[ddl] run reorg job wait timeout"] [waitTime=5s] [elementType=_idx_] [elementID=11] [totalAddedRowCount=7183573] [doneKey=6774522551193595979] [][2023/08/24 22:10:01.391 +08:00] [INFO] [reorg.go:284] ["[ddl] run reorg job wait timeout"] [waitTime=5s] [elementType=_idx_] [elementID=11] [totalAddedRowCount=7262165] [doneKey=6774522551193595979] [][2023/08/24 22:10:06.481 +08:00] [INFO] [reorg.go:284] ["[ddl] run reorg job wait timeout"] [waitTime=5s] [elementType=_idx_] [elementID=11] [totalAddedRowCount=7336405] [doneKey=6774522551193595979] [][2023/08/24 22:10:11.570 +08:00] [INFO] [reorg.go:284] ["[ddl] run reorg job wait timeout"] [waitTime=5s] [elementType=_idx_] [elementID=11] [totalAddedRowCount=7419143] [doneKey=6774522551193595979] [][2023/08/24 22:10:16.640 +08:00] [INFO] [reorg.go:284] ["[ddl] run reorg job wait timeout"] [waitTime=5s] [elementType=_idx_] [elementID=11] [totalAddedRowCount=7464902] [doneKey=6774522551193595979] [][2023/08/24 22:10:21.804 +08:00] [INFO] [reorg.go:284] ["[ddl] run reorg job wait timeout"] [waitTime=5s] [elementType=_idx_] [elementID=11] [totalAddedRowCount=7544549] [doneKey=6774692334375633788.next] [][2023/08/24 22:11:43.519 +08:00] [INFO] [reorg.go:284] ["[ddl] run reorg job wait timeout"] [waitTime=5s] [elementType=_idx_] [elementID=11] [totalAddedRowCount=8656519] [doneKey=6775048577556492528.next] [][2023/08/24 22:11:48.593 +08:00] [INFO] [reorg.go:284] ["[ddl] run reorg job wait timeout"] [waitTime=5s] [elementType=_idx_] [elementID=11] [totalAddedRowCount=8737159] [doneKey=6775048577556492528.next] [][2023/08/24 22:11:53.648 +08:00] [INFO] [reorg.go:284] ["[ddl] run reorg job wait timeout"] [waitTime=5s] [elementType=_idx_] [elementID=11] [totalAddedRowCount=8792281] [doneKey=6775048577556492528.next] [][2023/08/24 22:11:58.740 +08:00] [INFO] [reorg.go:284] ["[ddl] run reorg job wait timeout"] [waitTime=5s] [elementType=_idx_] [elementID=11] [totalAddedRowCount=8826499] [doneKey=6775048577556492528.next] [][2023/08/25 04:37:44.221 +08:00] [INFO] [reorg.go:301] ["[ddl] run reorg job wait timeout"] [waitTime=5s] [elementType=_idx_] [elementID=11] [totalAddedRowCount=256292839] [doneKey=6858402545246070082.next] [][2023/08/25 04:37:49.308 +08:00] [INFO] [reorg.go:301] ["[ddl] run reorg job wait timeout"] [waitTime=5s] [elementType=_idx_] [elementID=11] [totalAddedRowCount=256353399] [doneKey=6858402545246070082.next] [][2023/08/25 04:37:54.366 +08:00] [INFO] [reorg.go:301] ["[ddl] run reorg job wait timeout"] [waitTime=5s] [elementType=_idx_] [elementID=11] [totalAddedRowCount=256386509] [doneKey=6858402545246070082.next] [][2023/08/25 04:37:59.438 +08:00] [INFO] [reorg.go:301] ["[ddl] run reorg job wait timeout"] [waitTime=5s] [elementType=_idx_] [elementID=11] [totalAddedRowCount=256442291] [doneKey=6858463469667393591.next] []

复制代码

根因分析:

[2023/08/24 22:00:27.008 +08:00] [INFO] [conn.go:877] [“command dispatched failed”] [conn=5897899] [connInfo=“id:5897899, addr:10.16.56.178:28898 status:10, collation:utf8_general_ci, user:tidb_cold_w”] [command =Query] [status=“inTxn:0, autocommit:1”] [sql=“ALTER TABLE `voucher_detail` ADD INDEX idx_create(created, vender_id)“] [txn_mode=PESSIMISTIC] [err=”[ddl:1061]index already exist idx_create; a background job is t rying to add the same index, please check by `ADMIN SHOW DDL JOBS`


[2023/08/25 01:07:51.410 +08:00] [INFO] [session.go:3311] [“CRUCIAL OPERATION”] [conn=0] [schemaVersion=3242] [cur_db=] [sql=“ALTER TABLE mysql.stats_meta ADD COLUMN snapshot BIGINT(64) UNSIGNED NOT NULL DEFAULT 0”] [user=]


[2023/08/25 01:07:51.444 +08:00] [INFO] [ddl.go:632] [“[ddl] start DDL job”] [job=“ID:4591, Type:add column, State:queueing, SchemaState:none, SchemaID:3, TableID:21, RowCount:0, ArgLen:3, start time: 2023-08-25 01:07:51.419 +0800 CST, Err:<nil>, ErrCount:0, SnapshotVersion:0”] [query=“ALTER TABLE mysql.stats_meta ADD COLUMN snapshot BIGINT(64) UNSIGNED NOT NULL DEFAULT 0”]


因升级导致内部执行的 ”ALTER TABLE mysql.stats_meta ADD COLUMN snapshot BIGINT(64) UNSIGNED NOT NULL DEFAULT 0”


在 DDL 执行队列中排在了物理 DDL “ALTER TABLE `voucher_detail` ADD INDEX idx_create(created, vender_id)“ 后面,导致升级并未真正完成,卡住了打开 4000 端口的流程。

改进:在进行大版本升级时,操作集群不要有大 DDL 执行,提前检查【ADMIN SHOW DDL JOBS】

最奇怪的事情是:物理 DDL 执行完成后,tidb 计算节点打开服务后,进数据库看那个索引并没有加上。

TiDB 升级版本的收益

下面简单列举一些正向收益数据:

升级前是 5.1.x 版本,升级后 6.1.5 版本:Duration 99 基本降了一半左右

升级前:



升级后:


内存使用降了一半左右

升级前后 tso 下降明显


总结上述几个问题,升级 TiDB 数据库版本是一个重型运维操作,多测试多验证。


多参考官方升级文档https://docs.pingcap.com/zh/tidb/stable/smooth-upgrade-tidb


有 TiFlash 组件的更要多收集相关运维操作文档https://docs.pingcap.com/zh/tidb/stable/tiflash-upgrade-guide


做到升级过程心中有数,稳定稳定稳定!!!


以上,


欢迎技术交流。


发布于: 刚刚阅读数: 3
用户头像

TiDB 社区官网:https://tidb.net/ 2021-12-15 加入

TiDB 社区干货传送门是由 TiDB 社区中布道师组委会自发组织的 TiDB 社区优质内容对外宣布的栏目,旨在加深 TiDBer 之间的交流和学习。一起构建有爱、互助、共创共建的 TiDB 社区 https://tidb.net/

评论

发布
暂无评论
记三次升级 TiDB 集群到 v6.1.5 遇到的案例分析过程&升级收益_版本升级_TiDB 社区干货传送门_InfoQ写作社区