写点什么

通过 BR 完成不同 K8s 的 TiDB 集群的数据恢复

  • 2022 年 7 月 11 日
  • 本文字数:5529 字

    阅读完需:约 18 分钟

作者: pepezzzz 原文来源:https://tidb.net/blog/6b18353b

背景

BR 工具能实现 KV 数据的备份和快速还原。在测试 TiDB 集群性能偏弱的情况下,提升生产数据到测试环境的迁移效率。支持指定表清单进行操作,和备份还原限速等功能。单次操作时间较于 dumpling / lightning 逻辑迁移能提升 3 倍以上的效率。

1、准备工作

在源集群和目标还原集群上先配置 TiKV 的 NFS 目录

a. 以目标集群为例,编辑 tc 添加 tikv 的 additionalVolumes 使用 nfs 的定义。


[root@x.x.x.163 ~] #  kubectl edit tc uat-tidb -n tidbspec:  tikv:    additionalVolumes:    - name: baknfs      nfs:        server: x.x.x.180        path: /data    additionalVolumeMounts:    - name: baknfs      mountPath: /bk_data[root@x.x.x.163 ~]# kubectl get pods -n tidbNAME                                 READY   STATUS        RESTARTS   AGEuat-tidb-discovery-8574b88df-zpfbg   1/1     Running       0          94duat-tidb-monitor-6bf8675f97-f2wpb    3/3     Running       0          95duat-tidb-pd-0                        1/1     Running       0          49duat-tidb-pd-1                        1/1     Running       0          94duat-tidb-pd-2                        1/1     Running       7          94duat-tidb-tidb-0                      2/2     Running       0          43muat-tidb-tikv-0                      1/1     Running       1          29duat-tidb-tikv-1                      1/1     Running       2          29duat-tidb-tikv-2                      1/1     Running       1          29duat-tidb-tikv-3                      1/1     Terminating   0          29d    
复制代码


b. 滚动重启后,pod 内和所在 work node 节点上会有 nfs 的挂接。


[root@x.x.x.167 ~]#  df -h |grep datax.x.x.180:/data                                      992G  166G  826G  17% /var/lib/kubelet/pods/7ed60c2f-bd80-4017-9202-21f44b7e4eb3/volumes/kubernetes.io~nfs/baknfs[root@x.x.x.167 ~]#  cd /var/lib/kubelet/pods/7ed60c2f-bd80-4017-9202-21f44b7e4eb3/volumes/kubernetes.io~nfs/baknfs/bk_test[root@x.x.x.167 /var/lib/kubelet/pods/7ed60c2f-bd80-4017-9202-21f44b7e4eb3/volumes/kubernetes.io~nfs/baknfs/bk_test]# ls4_1542800_2634_8f570f541985f65eee0363cf943cd442addfecd0c33a578e32e007ed08f8a41f_1636698811852_write.sst4_1542800_2634_aaf6e66052cb6b18a8d3629abaa4e1bb5009ea7c66b3af73030e88712c8aaaf8_1636698811858_write.sst5_1542424_2562_0a18eb8d9a177225cadf7bf7a0ab11a8b58e5864ab1e066445f5f44e09f2de66_1636698811841_write.sst5_1542424_2562_47baf93dda75bc96b63d86972390c0f34fcd930d720777ee8dfd7ba227095cb6_1636698811834_write.sst5_1542424_2562_8797f837d2ae4942bc47238ec81ddf5f747c99446f814b1ac9fc17f6a6bd526d_1636698811826_write.sstbackup.lockbackupmeta
[root@x.x.x.163 ~]# kubectl exec -it uat-tidb-tikv-3 -n tidb /bin/sh/ # df -hFilesystem Size Used Available Use% Mounted onx.x.x.180:/data 991.1G 333.5G 657.5G 34% /bk_data/dev/sdb1 590.5G 185.2G 375.2G 33% /var/lib/tikv...略...
/ # ls /bk_databackup-nfs backup-nfs-old bak20211112 bk_test
复制代码

在测试环境上传指定的镜像到 Harbor

[root@x.x.x.163 ~]# docker login x.x.x.162Authenticating with existing credentials...WARNING! Your password will be stored unencrypted in /root/.docker/config.json.Configure a credential helper to remove this warning. Seehttps://docs.docker.com/engine/reference/commandline/login/#credentials-store
Login Succeeded[root@x.x.x.163 ~]# docker tag pingcap/br:v4.0.12 x.x.x.162/pingcap/br:v4.0.12[root@x.x.x.163 ~]# docker push x.x.x.162/pingcap/br:v4.0.12[root@x.x.x.163 ~]# docker tag pingcap/tidb-backup-manager:v1.1.9 x.x.x.162/pingcap/tidb-backup-manager:v1.1.9 [root@x.x.x.163 ~]# docker push x.x.x.162/pingcap/tidb-backup-manager:v1.1.9 The push refers to repository [x.x.x.162/pingcap/tidb-backup-manager]
复制代码

删除迁移表

mysql> use account_db;


mysql> drop table table_account_collect_entry;


mysql> drop table table_account_counts_cst;

2、配置备份

生产环境定义按表名过滤的 ad-hoc 备份,可按需配置 concurrency 和 rateLimit 参数,减少对业务的影响。


#vi br.yaml
复制代码


apiVersion: pingcap.com/v1alpha1kind: Backupmetadata:  name: backup-111202  namespace: tidb-clusterspec:  br:    cluster: tidb-cluster    clusterNamespace: tidb-cluster    concurrency: 1    rateLimit: 10    checksum: true  toolImage: 72.0.253.205/tidb/pingcap/br:v4.0.12  cleanPolicy: Delete  tableFilter:  - "account_db.table_account_collect_entry"  - "account_db.table_account_counts_cst"  local:    prefix: backup-nfs/20211112    volume:      name: nfs      nfs:        server: x.x.x.92        path: /data/tidb_backup    volumeMount:      name: nfs      mountPath: /tidb_backup
复制代码


在生产源集群上执行备份,注意观察 BR JOB POD 的日志。


#kubectl apply -f br.yaml
复制代码

3、配置还原

将生产源集群备份存储 x.x.x.92 的备份文件夹 /data/tidb_backup/backup-nfs/20211112 复制到 x.x.x.180 的 /data 目录下,因 prefix 命令要求,改名成 bak20211112。


定义 ad-hoc 还原,指定正确的目录,在还原目标集群上执行。本例未配置 concurrency 和 rateLimit 参数。


[root@x.x.x.163 ~]# vi restore.yaml apiVersion: pingcap.com/v1alpha1kind: Restoremetadata:  name: restore-1113  namespace: tidbspec:  br:    cluster: uat-tidb    clusterNamespace: tidb    toolImage: x.x.x.162/pingcap/br:v4.0.12  local:    prefix: bak20211112    volume:      name: baknfs      nfs:        server: x.x.x.180        path: /data    volumeMount:      name: baknfs      mountPath: /bk_data
复制代码


 [root@x.x.x.163 ~]# kubectl apply -f restore.yaml  [root@x.x.x.163 ~]# kubectl get rt -n tidbNAME           STATUS     STARTED                COMPLETED              COMMITTS             AGErestore-1112   Complete   2021-11-12T11:37:27Z   2021-11-12T11:37:31Z   429050773156921347   15hrestore-1113   Running    <no value>             <no value>                                  23m[root@x.x.x.163 ~]# kubectl get pods -n tidbNAME                                 READY   STATUS      RESTARTS   AGErestore-restore-1113-wqwsh           1/1     Running     0          33m...略...
复制代码


[root@x.x.x.163 ~]# kubectl logs restore-restore-1113-wqwsh  -n tidb |moreCreate rclone.conf file./tidb-backup-manager restore --namespace=tidb --restoreName=restore-1113 --tikvVersion=v4.0.12I1113 10:15:13.411502       1 restore.go:73] start to process restore tidb/restore-1113I1113 10:15:13.447383       1 restore_status_updater.go:64] Restore: [tidb/restore-1113] updated successfullyI1113 10:15:14.299951       1 restore.go:66] Running br command with args: [restore full --pd=uat-tidb-pd.tidb:2379 --storage=local:///bk_data/bak20211112]I1113 10:15:14.440232       1 restore.go:89] [2021/11/13 10:15:14.439 +08:00] [INFO] [info.go:40] ["Welcome to Backup & Restore (BR)"] [release-version=v4.0.12] [git-hash=14d55a7a3696a37e6e7f199f75c5dc405383c547] [git-branch=release-4.0] [go-version=go1.13] [utc-build-time="2021-04-02 10:41:29"] [race-enabled=false]I1113 10:15:14.440356       1 restore.go:89] [2021/11/13 10:15:14.440 +08:00] [INFO] [common.go:458] [arguments] [__command="br restore full"] [pd="[uat-tidb-pd.tidb:2379]"] [storage=local:///bk_data/bak20211112]...略...
复制代码


在 BR JOB POD 的日志中可以观测到恢复进度,在 tidb-controller-manager pod 的日志中可以查看任务执行发起和结束过程。


[root@x.x.x.163 ~]# kubectl logs restore-restore-1113-wqwsh  -n tidb -f |grep progress.goI1113 16:47:16.201855       1 restore.go:89] [2021/11/13 16:47:16.201 +08:00] [INFO] [progress.go:134] [progress] [step="Full restore"] [progress=78.54%] [count="45467 / 57887"] [speed="1 p/s"] [elapsed=6h32m0s] [remaining=3h16m8s]
[root@x.x.x.163 ~]# kubectl logs tidb-controller-manager-5457786d5c-c9gjq -n tidb-adminI1113 10:14:56.709066 1 tidbcluster_control.go:66] TidbCluster: [tidb/uat-tidb] updated successfullyI1113 10:14:58.483964 1 tidbcluster_control.go:66] TidbCluster: [tidb/uat-tidb] updated successfullyI1113 10:15:09.008373 1 event.go:255] Event(v1.ObjectReference{Kind:"Restore", Namespace:"tidb", Name:"restore-1113", UID:"5deccc07-c66d-4efd-94c4-aa94f7757450", APIVersion:"pingcap.com/v1alpha1", ResourceVersion:"297617481", FieldPath:""}): type: 'Normal' reason: 'SuccessfulCreate' create job tidb/restore-restore-1113 for cluster restore-1113 restore successfulI1113 10:15:09.018953 1 restore_status_updater.go:64] Restore: [tidb/restore-1113] updated successfullyI1113 10:15:26.987474 1 tidbcluster_control.go:66] TidbCluster: [tidb/uat-tidb] updated successfullyI1113 10:15:28.697291 1 tidbcluster_control.go:66] TidbCluster: [tidb/uat-tidb] updated successfully...略...
复制代码

4、诊断方法

当 rt 资源一直没有 STATUS 值,也没有发起对应的 BR JOB POD,需要查看 controller-manager 日志。


[root@x.x.x.163 ~]# kubectl get rt -n tidbNAME           STATUS     STARTED                COMPLETED              COMMITTS             AGErestore-1112   Complete   2021-11-12T11:37:27Z   2021-11-12T11:37:31Z   429050773156921347   14hrestore-1113                                                                                 8m48s[root@x.x.x.163 ~]# kubectl get pods -n tidb-adminNAME                                       READY   STATUS    RESTARTS   AGEtidb-controller-manager-5457786d5c-c9gjq   1/1     Running   80         94dtidb-scheduler-6c9c6bd7f7-7bh2v            2/2     Running   0          24h...略...
复制代码


[root@x.x.x.163 ~]# kubectl logs tidb-controller-manager-5457786d5c-c9gjq -n tidb-adminE1113 10:02:30.890073       1 reflector.go:123] k8s.io/client-go@v0.0.0/tools/cache/reflector.go:96: Failed to list *v1alpha1.Restore: v1alpha1.RestoreList.Items: []v1alpha1.Restore: v1alpha1.Restore.Spec: v1alpha1.RestoreSpec.StorageProvider: Local: v1alpha1.LocalStorageProvider.Prefix: ReadString: expects " or n, but found 2, error found in #10 byte of ...|"prefix":20211112,"v|..., bigger context ...|160.3.162/pingcap/br:v4.0.12"},"local":{"prefix":20211112,"volume":{"name":"baknfs","nfs":{"path":"/|...E1113 10:02:31.893830       1 reflector.go:123] k8s.io/client-go@v0.0.0/tools/cache/reflector.go:96: Failed to list *v1alpha1.Restore: v1alpha1.RestoreList.Items: []v1alpha1.Restore: v1alpha1.Restore.Spec: v1alpha1.RestoreSpec.StorageProvider: Local: v1alpha1.LocalStorageProvider.Prefix: ReadString: expects " or n, but found 2, error found in #10 byte of ...|"prefix":20211112,"v|..., bigger context ...|160.3.162/pingcap/br:v4.0.12"},"local":{"prefix":20211112,"volume":{"name":"baknfs","nfs":{"path":"/|...
复制代码


br 对象的 .spec.local.prefix 是不是不允许数字开始,也就是不允许用一个数字开始命名的一层子目录。

删除未成功的 BK 资源并将目录改名

[root@x.x.x.163 ~]# kubectl delete rt restore-1113 -n tidb[root@x.x.x.167 /var/lib/kubelet/pods/7ed60c2f-bd80-4017-9202-21f44b7e4eb3/volumes/kubernetes.io~nfs/baknfs]# mv 20211112 bak20211112[root@x.x.x.167 /var/lib/kubelet/pods/7ed60c2f-bd80-4017-9202-21f44b7e4eb3/volumes/kubernetes.io~nfs/baknfs]# lsbackup-nfs  backup-nfs-old  bak20211112  bk_test
复制代码


需要对应修订 restore.yaml


  local:    prefix: bak20211112
复制代码

注意事项

  • 注意 BR 任务本身的限制,如:要求 new collation 的配置一致等;

  • 注意备份文件夹的命名;

  • 如果出现任务发起异常,注意观察 tidb-controller-manager pod 的日志;

  • 注意各个步骤中的 br pod 的日志,保证每个操作步骤的成功执行;

  • 注意统计信息的更新,某些情况下需要手工发起;


发布于: 刚刚阅读数: 3
用户头像

TiDB 社区官网:https://tidb.net/ 2021.12.15 加入

TiDB 社区干货传送门是由 TiDB 社区中布道师组委会自发组织的 TiDB 社区优质内容对外宣布的栏目,旨在加深 TiDBer 之间的交流和学习。一起构建有爱、互助、共创共建的 TiDB 社区 https://tidb.net/

评论

发布
暂无评论
通过 BR 完成不同 K8s 的 TiDB 集群的数据恢复_故障排查/诊断_TiDB 社区干货传送门_InfoQ写作社区