写点什么

迁移 PD 坑 -cdc 任务全部 stop

  • 2023-04-21
    北京
  • 本文字数:8780 字

    阅读完需:约 29 分钟

作者: weixiaobing 原文来源:https://tidb.net/blog/489b279e


说明:测试环境 v4.0.15,对于 cdc 来说是一个非常老的版本,可能存在比较多的问题,如果是生产环境,尽量升级到比较新的版本,比如是 v6.1.6,v6.5.1 这些版本无论是在性能和功能上面都有非常大的提升。下面的问题在 v5.4.1 测试就没有问题,所以推荐使用新的稳定的 LTS 版本。

cdc 的基本操作命令

#创建cdctiup ctl:v4.0.15 cdc  changefeed create --pd=http://10.2.103.115:32379 --sink-uri="tidb://root:tidb@10.2.103.116:34000/" --changefeed-id="simple-replication-task" --config=cdc.toml #查看cdc 任务状态tiup ctl:v4.0.15  cdc changefeed list --pd=http://10.2.103.115:32379#查看具体任务状态tiup ctl:v4.0.15  cdc changefeed query -c simple-replication-task --pd=http://10.2.103.115:32379#移除任务tiup ctl:v4.0.15  cdc changefeed remove -c simple-replication-task --pd=http://10.2.103.115:32379
复制代码

PD 的状态

[tidb@vm115 ~]$ tiup cluster display tidb-devtiup is checking updates for component cluster ...A new version of cluster is available:   The latest version:         v1.12.0   Local installed version:    v1.11.3   Update current component:   tiup update cluster   Update all components:      tiup update --all
Starting component `cluster`: /home/tidb/.tiup/components/cluster/v1.11.3/tiup-cluster display tidb-devCluster type: tidbCluster name: tidb-devCluster version: v4.0.15Deploy user: tidbSSH type: builtinDashboard URL: http://10.2.103.115:32379/dashboardGrafana URL: http://10.2.103.115:7000ID Role Host Ports OS/Arch Status Data Dir Deploy Dir-- ---- ---- ----- ------- ------ -------- ----------10.2.103.115:9793 alertmanager 10.2.103.115 9793/9794 linux/x86_64 Up /data1/tidb-data/alertmanager-9793 /data1/tidb-deploy/alertmanager-979310.2.103.115:9893 alertmanager 10.2.103.115 9893/9894 linux/x86_64 Up /data1/tidb-data/alertmanager-9893 /data1/tidb-deploy/alertmanager-989310.2.103.115:8400 cdc 10.2.103.115 8400 linux/x86_64 Up /data1/tidb-data/cdc-8400 /data1/tidb-deploy/cdc-840010.2.103.115:7000 grafana 10.2.103.115 7000 linux/x86_64 Up - /data1/tidb-deploy/grafana-700010.2.103.115:32379 pd 10.2.103.115 32379/3380 linux/x86_64 Up|L|UI /data1/tidb-data/pd-32379 /data1/tidb-deploy/pd-3237910.2.103.115:35379 pd 10.2.103.115 35379/3580 linux/x86_64 Up /data1/tidb-data/pd-35379 /data1/tidb-deploy/pd-3537910.2.103.115:36379 pd 10.2.103.115 36379/3680 linux/x86_64 Up /data1/tidb-data/pd-36379 /data1/tidb-deploy/pd-3637910.2.103.115:9590 prometheus 10.2.103.115 9590/35020 linux/x86_64 Up /data1/tidb-data/prometheus-9590 /data1/tidb-deploy/prometheus-959010.2.103.115:43000 tidb 10.2.103.115 43000/20080 linux/x86_64 Up - /data1/tidb-deploy/tidb-3400010.2.103.115:30160 tikv 10.2.103.115 30160/30180 linux/x86_64 Up /data1/tidb-data/tikv-30160 /data1/tidb-deploy/tikv-30160Total nodes: 10
复制代码

cdc 任务状态

[tidb@vm115 ~]$ tiup ctl:v4.0.15  cdc changefeed list --pd=http://10.2.103.115:32379Starting component `ctl`: /home/tidb/.tiup/components/ctl/v4.0.15/ctl cdc changefeed list --pd=http://10.2.103.115:32379[  {    "id": "simple-replication-task2",    "summary": {      "state": "normal",      "tso": 440757324149424131,      "checkpoint": "2023-04-13 11:15:59.237",      "error": null    }  }]
复制代码

切换 PD leader

[tidb@vm115 ~]$ tiup ctl:v4.0.15 pd  -u http://10.2.103.115:32379 member leader transfer pd-10.2.103.115-35379Starting component `ctl`: /home/tidb/.tiup/components/ctl/v4.0.15/ctl pd -u http://10.2.103.115:32379 member leader transfer pd-10.2.103.115-35379Success![tidb@vm115 ~]$ tiup cluster display tidb-devtiup is checking updates for component cluster ...A new version of cluster is available:   The latest version:         v1.12.0   Local installed version:    v1.11.3   Update current component:   tiup update cluster   Update all components:      tiup update --all
Starting component `cluster`: /home/tidb/.tiup/components/cluster/v1.11.3/tiup-cluster display tidb-devCluster type: tidbCluster name: tidb-devCluster version: v4.0.15Deploy user: tidbSSH type: builtinDashboard URL: http://10.2.103.115:32379/dashboardGrafana URL: http://10.2.103.115:7000ID Role Host Ports OS/Arch Status Data Dir Deploy Dir-- ---- ---- ----- ------- ------ -------- ----------10.2.103.115:9793 alertmanager 10.2.103.115 9793/9794 linux/x86_64 Up /data1/tidb-data/alertmanager-9793 /data1/tidb-deploy/alertmanager-979310.2.103.115:9893 alertmanager 10.2.103.115 9893/9894 linux/x86_64 Up /data1/tidb-data/alertmanager-9893 /data1/tidb-deploy/alertmanager-989310.2.103.115:8400 cdc 10.2.103.115 8400 linux/x86_64 Up /data1/tidb-data/cdc-8400 /data1/tidb-deploy/cdc-840010.2.103.115:7000 grafana 10.2.103.115 7000 linux/x86_64 Up - /data1/tidb-deploy/grafana-700010.2.103.115:32379 pd 10.2.103.115 32379/3380 linux/x86_64 Up|UI /data1/tidb-data/pd-32379 /data1/tidb-deploy/pd-3237910.2.103.115:35379 pd 10.2.103.115 35379/3580 linux/x86_64 Up|L /data1/tidb-data/pd-35379 /data1/tidb-deploy/pd-3537910.2.103.115:36379 pd 10.2.103.115 36379/3680 linux/x86_64 Up /data1/tidb-data/pd-36379 /data1/tidb-deploy/pd-3637910.2.103.115:9590 prometheus 10.2.103.115 9590/35020 linux/x86_64 Up /data1/tidb-data/prometheus-9590 /data1/tidb-deploy/prometheus-959010.2.103.115:43000 tidb 10.2.103.115 43000/20080 linux/x86_64 Up - /data1/tidb-deploy/tidb-3400010.2.103.115:30160 tikv 10.2.103.115 30160/30180 linux/x86_64 Up /data1/tidb-data/tikv-30160 /data1/tidb-deploy/tikv-30160Total nodes: 10
复制代码


切换 PD leader 对 cdc 没有影响


[tidb@vm115 ~]$  tiup ctl:v4.0.15  cdc changefeed list --pd=http://10.2.103.115:32379Starting component `ctl`: /home/tidb/.tiup/components/ctl/v4.0.15/ctl cdc changefeed list --pd=http://10.2.103.115:32379[  {    "id": "simple-replication-task2",    "summary": {      "state": "normal",      "tso": 440757380830461955,      "checkpoint": "2023-04-13 11:19:35.458",      "error": null    }  }][tidb@vm115 ~]$ 
复制代码

缩容 PD 节点

cdc 任务报错


[tidb@vm115 ~]$ tiup cluster scale-in tidb-dev -N 10.2.103.115:35379tiup is checking updates for component cluster ...A new version of cluster is available:   The latest version:         v1.12.1   Local installed version:    v1.11.3   Update current component:   tiup update cluster   Update all components:      tiup update --all
Starting component `cluster`: /home/tidb/.tiup/components/cluster/v1.11.3/tiup-cluster scale-in tidb-dev -N 10.2.103.115:35379This operation will delete the 10.2.103.115:35379 nodes in `tidb-dev` and all their data.Do you want to continue? [y/N]:(default=N) yScale-in nodes...+ [ Serial ] - SSHKeySet: privateKey=/home/tidb/.tiup/storage/cluster/clusters/tidb-dev/ssh/id_rsa, publicKey=/home/tidb/.tiup/storage/cluster/clusters/tidb-dev/ssh/id_rsa.pub+ [Parallel] - UserSSH: user=tidb, host=10.2.103.115+ [Parallel] - UserSSH: user=tidb, host=10.2.103.115+ [Parallel] - UserSSH: user=tidb, host=10.2.103.115+ [Parallel] - UserSSH: user=tidb, host=10.2.103.115+ [Parallel] - UserSSH: user=tidb, host=10.2.103.115+ [Parallel] - UserSSH: user=tidb, host=10.2.103.115+ [Parallel] - UserSSH: user=tidb, host=10.2.103.115+ [Parallel] - UserSSH: user=tidb, host=10.2.103.115+ [Parallel] - UserSSH: user=tidb, host=10.2.103.115+ [Parallel] - UserSSH: user=tidb, host=10.2.103.115+ [ Serial ] - ClusterOperate: operation=DestroyOperation, options={Roles:[] Nodes:[10.2.103.115:35379] Force:false SSHTimeout:5 OptTimeout:120 APITimeout:600 IgnoreConfigCheck:false NativeSSH:false SSHType: Concurrency:5 SSHProxyHost: SSHProxyPort:22 SSHProxyUser:tidb SSHProxyIdentity:/home/tidb/.ssh/id_rsa SSHProxyUsePassword:false SSHProxyTimeout:5 CleanupData:false CleanupLog:false CleanupAuditLog:false RetainDataRoles:[] RetainDataNodes:[] DisplayMode:default Operation:StartOperation}Stopping component pd Stopping instance 10.2.103.115 Stop pd 10.2.103.115:35379 successDestroying component pd Destroying instance 10.2.103.115Destroy 10.2.103.115 success- Destroy pd paths: [/data1/tidb-data/pd-35379 /data1/tidb-deploy/pd-35379/log /data1/tidb-deploy/pd-35379 /etc/systemd/system/pd-35379.service]+ [ Serial ] - UpdateMeta: cluster=tidb-dev, deleted=`'10.2.103.115:35379'`+ [ Serial ] - UpdateTopology: cluster=tidb-dev+ Refresh instance configs - Generate config pd -> 10.2.103.115:32379 ... Done - Generate config pd -> 10.2.103.115:36379 ... Done - Generate config tikv -> 10.2.103.115:30160 ... Done - Generate config tidb -> 10.2.103.115:43000 ... Done - Generate config cdc -> 10.2.103.115:8400 ... Done - Generate config prometheus -> 10.2.103.115:9590 ... Done - Generate config grafana -> 10.2.103.115:7000 ... Done - Generate config alertmanager -> 10.2.103.115:9793 ... Done - Generate config alertmanager -> 10.2.103.115:9893 ... Done+ Reload prometheus and grafana - Reload prometheus -> 10.2.103.115:9590 ... Done - Reload grafana -> 10.2.103.115:7000 ... DoneScaled cluster `tidb-dev` in successfully
复制代码

cdc 任务报错

[tidb@vm115 ~]$ tiup ctl:v4.0.15  cdc changefeed list --pd=http://10.2.103.115:32379Starting component `ctl`: /home/tidb/.tiup/components/ctl/v4.0.15/ctl cdc changefeed list --pd=http://10.2.103.115:32379[2023/04/14 09:18:52.714 +08:00] [WARN] [client_changefeed.go:170] ["query changefeed info failed"] [error="owner not found"][  {    "id": "simple-replication-task2",    "summary": null  }][tidb@vm115 ~]$ tiup ctl:v4.0.15  cdc changefeed list --pd=http://10.2.103.115:32379Starting component `ctl`: /home/tidb/.tiup/components/ctl/v4.0.15/ctl cdc changefeed list --pd=http://10.2.103.115:32379[2023/04/14 09:19:00.720 +08:00] [WARN] [client_changefeed.go:170] ["query changefeed info failed"] [error="owner not found"][  {    "id": "simple-replication-task2",    "summary": null  }][tidb@vm115 ~]$ tiup ctl:v4.0.15  cdc changefeed list --pd=http://10.2.103.115:32379Starting component `ctl`: /home/tidb/.tiup/components/ctl/v4.0.15/ctl cdc changefeed list --pd=http://10.2.103.115:32379[  {    "id": "simple-replication-task2",    "summary": {      "state": "stopped",      "tso": 440778127778512897,      "checkpoint": "2023-04-14 09:18:38.784",      "error": {        "addr": "10.2.103.115:8400",        "code": "CDC:ErrProcessorUnknown",        "message": "failed to update info: [CDC:ErrReachMaxTry]reach maximum try: 3"      }    }  }][tidb@vm115 ~]$ 
复制代码

cdc 报错日志

[2023/04/14 09:18:40.348 +08:00] [ERROR] [processor.go:497] ["failed to flush task position"] [changefeed=simple-replication-task2] [error="[CDC:ErrPDEtcdAPIError]rpc error: code = Unknown desc = raft: stopped"] [errorVerbose="[CDC:ErrPDEtcdAPIError]rpc error: code = Unknown desc = raft: stopped\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/errors.go:174\ngithub.com/pingcap/errors.(*Error).GenWithStackByCause\n\tgithub.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/normalize.go:279\ngithub.com/pingcap/ticdc/pkg/errors.WrapError\n\tgithub.com/pingcap/ticdc@/pkg/errors/helper.go:30\ngithub.com/pingcap/ticdc/cdc/kv.CDCEtcdClient.PutTaskPositionOnChange\n\tgithub.com/pingcap/ticdc@/cdc/kv/etcd.go:739\ngithub.com/pingcap/ticdc/cdc.(*oldProcessor).flushTaskPosition\n\tgithub.com/pingcap/ticdc@/cdc/processor.go:494\ngithub.com/pingcap/ticdc/cdc.(*oldProcessor).flushTaskStatusAndPosition\n\tgithub.com/pingcap/ticdc@/cdc/processor.go:560\ngithub.com/pingcap/ticdc/cdc.(*oldProcessor).positionWorker.func1.1\n\tgithub.com/pingcap/ticdc@/cdc/processor.go:318\ngithub.com/pingcap/ticdc/pkg/retry.run\n\tgithub.com/pingcap/ticdc@/pkg/retry/retry_with_opt.go:54\ngithub.com/pingcap/ticdc/pkg/retry.Do\n\tgithub.com/pingcap/ticdc@/pkg/retry/retry_with_opt.go:32\ngithub.com/pingcap/ticdc/cdc.(*oldProcessor).positionWorker.func1\n\tgithub.com/pingcap/ticdc@/cdc/processor.go:317\ngithub.com/pingcap/ticdc/cdc.(*oldProcessor).positionWorker.func2\n\tgithub.com/pingcap/ticdc@/cdc/processor.go:349\ngithub.com/pingcap/ticdc/cdc.(*oldProcessor).positionWorker\n\tgithub.com/pingcap/ticdc@/cdc/processor.go:418\ngithub.com/pingcap/ticdc/cdc.(*oldProcessor).Run.func1\n\tgithub.com/pingcap/ticdc@/cdc/processor.go:251\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/sync@v0.0.0-20201020160332-67f06af15bc9/errgroup/errgroup.go:57\nruntime.goexit\n\truntime/asm_amd64.s:1357"]
复制代码

解决方案

重新的 resume cdc 任务


[tidb@vm115 ~]$ tiup ctl:v4.0.15  cdc changefeed resume -c simple-replication-task2 --pd=http://10.2.103.115:32379[tidb@vm115 ~]$ tiup ctl:v4.0.15  cdc changefeed list --pd=http://10.2.103.115:32379Starting component `ctl`: /home/tidb/.tiup/components/ctl/v4.0.15/ctl cdc changefeed list --pd=http://10.2.103.115:32379[  {    "id": "simple-replication-task2",    "summary": {      "state": "normal",      "tso": 440778162669879297,      "checkpoint": "2023-04-14 09:20:51.884",      "error": null    }  }][tidb@vm115 ~]$
复制代码

升级到 v5.4.1 测试

缩容 PD

同样的操作,cdc 任务不报错


[tidb@vm115 ~]$ tiup cluster scale-in tidb-dev -N 10.2.103.115:35379 tiup is checking updates for component cluster ...A new version of cluster is available:   The latest version:         v1.12.1   Local installed version:    v1.11.3   Update current component:   tiup update cluster   Update all components:      tiup update --all
Starting component `cluster`: /home/tidb/.tiup/components/cluster/v1.11.3/tiup-cluster scale-in tidb-dev -N 10.2.103.115:35379This operation will delete the 10.2.103.115:35379 nodes in `tidb-dev` and all their data.Do you want to continue? [y/N]:(default=N) yScale-in nodes...+ [ Serial ] - SSHKeySet: privateKey=/home/tidb/.tiup/storage/cluster/clusters/tidb-dev/ssh/id_rsa, publicKey=/home/tidb/.tiup/storage/cluster/clusters/tidb-dev/ssh/id_rsa.pub+ [Parallel] - UserSSH: user=tidb, host=10.2.103.115+ [Parallel] - UserSSH: user=tidb, host=10.2.103.115+ [Parallel] - UserSSH: user=tidb, host=10.2.103.115+ [Parallel] - UserSSH: user=tidb, host=10.2.103.115+ [Parallel] - UserSSH: user=tidb, host=10.2.103.115+ [Parallel] - UserSSH: user=tidb, host=10.2.103.115+ [Parallel] - UserSSH: user=tidb, host=10.2.103.115+ [Parallel] - UserSSH: user=tidb, host=10.2.103.115+ [Parallel] - UserSSH: user=tidb, host=10.2.103.115+ [Parallel] - UserSSH: user=tidb, host=10.2.103.115+ [ Serial ] - ClusterOperate: operation=DestroyOperation, options={Roles:[] Nodes:[10.2.103.115:35379] Force:false SSHTimeout:5 OptTimeout:120 APITimeout:600 IgnoreConfigCheck:false NativeSSH:false SSHType: Concurrency:5 SSHProxyHost: SSHProxyPort:22 SSHProxyUser:tidb SSHProxyIdentity:/home/tidb/.ssh/id_rsa SSHProxyUsePassword:false SSHProxyTimeout:5 CleanupData:false CleanupLog:false CleanupAuditLog:false RetainDataRoles:[] RetainDataNodes:[] DisplayMode:default Operation:StartOperation}Stopping component pd Stopping instance 10.2.103.115 Stop pd 10.2.103.115:35379 successDestroying component pd Destroying instance 10.2.103.115Destroy 10.2.103.115 success- Destroy pd paths: [/data1/tidb-data/pd-35379 /data1/tidb-deploy/pd-35379/log /data1/tidb-deploy/pd-35379 /etc/systemd/system/pd-35379.service]+ [ Serial ] - UpdateMeta: cluster=tidb-dev, deleted=`'10.2.103.115:35379'`+ [ Serial ] - UpdateTopology: cluster=tidb-dev+ Refresh instance configs - Generate config pd -> 10.2.103.115:32379 ... Done - Generate config pd -> 10.2.103.115:36379 ... Done - Generate config tikv -> 10.2.103.115:30160 ... Done - Generate config tidb -> 10.2.103.115:43000 ... Done - Generate config cdc -> 10.2.103.115:8400 ... Done - Generate config prometheus -> 10.2.103.115:9590 ... Done - Generate config grafana -> 10.2.103.115:7000 ... Done - Generate config alertmanager -> 10.2.103.115:9793 ... Done - Generate config alertmanager -> 10.2.103.115:9893 ... Done+ Reload prometheus and grafana - Reload prometheus -> 10.2.103.115:9590 ... Done - Reload grafana -> 10.2.103.115:7000 ... DoneScaled cluster `tidb-dev` in successfully[tidb@vm115 ~]$
复制代码

cdc 状态正常

[tidb@vm115 ~]$ tiup ctl:v5.4.1  cdc changefeed list --pd=http://10.2.103.115:32379Starting component `ctl`: /home/tidb/.tiup/components/ctl/v5.4.1/ctl cdc changefeed list --pd=http://10.2.103.115:32379[  {    "id": "simple-replication-task2",    "summary": {      "state": "normal",      "tso": 440778284890324994,      "checkpoint": "2023-04-14 09:28:38.118",      "error": null    }  }][tidb@vm115 ~]$ 
复制代码

总结:

1、任务生产上面的变更,如果有条件都要在测试环境模拟、测试一下。


2、生产集群尽量升级到一些主流、稳定的版本上,过老的版本可能存在一些 BUG。


3、最新的 LTS 版本 cdc 功能和性能都要质的飞跃,推荐使用新的 LTS 版本。


发布于: 刚刚阅读数: 2
用户头像

TiDB 社区官网:https://tidb.net/ 2021-12-15 加入

TiDB 社区干货传送门是由 TiDB 社区中布道师组委会自发组织的 TiDB 社区优质内容对外宣布的栏目,旨在加深 TiDBer 之间的交流和学习。一起构建有爱、互助、共创共建的 TiDB 社区 https://tidb.net/

评论

发布
暂无评论
迁移PD坑-cdc任务全部stop_实践案例_TiDB 社区干货传送门_InfoQ写作社区