写点什么

TiDB Labs 云环境测试故障期间数据库零宕机

  • 2025-02-28
    北京
  • 本文字数:7859 字

    阅读完需:约 26 分钟

作者: EINTR 原文来源:https://tidb.net/blog/b624ac58


TiDB Labs从连接登录 TiDB Labs, 选择 ” 可用区故障期间数据库零宕机 ”, 支付对应的积分后, 启动实验环境, 等待几分钟后, 实验环境创建完成后, 可以的到私钥 key 和对应的主机 ip 地址.


1. 数据库环境配置

登录主机后, 使用配置模板配置修改成部署数据库的配置文件.


cp template-nine-nodes.yaml nine-nodes.yaml
[ec2-user@ip-10-90-3-53 ~]$ cat nine-nodes.yaml global: user: "ec2-user" ssh_port: 22 deploy_dir: "/tidb-deploy" data_dir: "/tidb-data" arch: "amd64"server_configs: pd: replication.max-replicas: 3pd_servers: - host: 10.90.3.53 client_port: 2379 peer_port: 2380 advertise_client_addr: "http://13.212.46.63:2379" - host: 10.90.2.107 client_port: 2379 peer_port: 2380 advertise_client_addr: "http://52.221.245.22:2379" - host: 10.90.1.143 client_port: 2379 peer_port: 2380 advertise_client_addr: "http://47.129.246.237:2379"tidb_servers: - host: 10.90.2.172 port: 4000 status_port: 10080 deploy_dir: /tidb-deploy/tidb-4000 log_dir: /tidb-deploy/tidb-4000/log - host: 10.90.1.216 port: 4000 status_port: 10080 deploy_dir: /tidb-deploy/tidb-4000 log_dir: /tidb-deploy/tidb-4000/logtikv_servers: - host:10.90.3.65 port: 20160 status_port: 20180 - host: 10.90.2.144 port: 20160 status_port: 20180 - host: 10.90.1.254 port: 20160 status_port: 20180monitoring_servers: - host: 10.90.4.114grafana_servers: - host: 10.90.4.114alertmanager_servers: - host: 10.90.4.114
复制代码


检查配置文件有效性, 确保配置文件没有错误.


tiup mirror set tidb-community-server-8.1.1-linux-amd64tiup cluster check nine-nodes.yaml \  --user ec2-user \  -i /home/ec2-user/.ssh/pe-class-key-singapore.pem \  --apply
// 检查完成后的输出+ Try to apply changes to fix failed checks - Applying changes on 10.90.3.53 ... ⠧ Sysctl: host=10.90.3.53 vm.swappiness = 0 - Applying changes on 10.90.1.143 ... ⠧ Sysctl: host=10.90.1.143 vm.swappiness = 0 - Applying changes on 10.90.1.254 ... ⠧ Sysctl: host=10.90.1.254 vm.swappiness = 0 - Applying changes on 10.90.4.114 ... ⠧ Sysctl: host=10.90.4.114 vm.swappiness = 0 - Applying changes on 10.90.2.107 ... ⠧ Sysctl: host=10.90.2.107 vm.swappiness = 0 - Applying changes on 10.90.3.65 ... ⠧ - Applying changes on 10.90.2.144 ... ⠧ - Applying changes on 10.90.2.172 ... ⠧ - Applying changes on 10.90.1.216 ... ⠧ + Try to apply changes to fix failed checks - Applying changes on 10.90.3.53 ... Done - Applying changes on 10.90.1.143 ... Done - Applying changes on 10.90.1.254 ... Done - Applying changes on 10.90.4.114 ... Done - Applying changes on 10.90.2.107 ... Done+ Try to apply changes to fix failed checks - Applying changes on 10.90.3.53 ... Done+ Try to apply changes to fix failed checks - Applying changes on 10.90.3.53 ... Done - Applying changes on 10.90.1.143 ... Done - Applying changes on 10.90.1.254 ... Done - Applying changes on 10.90.4.114 ... Done - Applying changes on 10.90.2.107 ... Done - Applying changes on 10.90.3.65 ... Done - Applying changes on 10.90.2.144 ... Done - Applying changes on 10.90.2.172 ... Done - Applying changes on 10.90.1.216 ... Done
复制代码


检查成功后, 使用配置文件部署数据库


tiup cluster deploy tidb-demo 8.1.1 ./nine-nodes.yaml \  --user ec2-user \  -i /home/ec2-user/.ssh/pe-class-key-singapore.pem \  --yes
// 部署成功后,输入以下信息Enabling component blackbox_exporter Enabling instance 10.90.4.114 Enabling instance 10.90.2.107 Enabling instance 10.90.1.254 Enabling instance 10.90.1.143 Enabling instance 10.90.2.172 Enabling instance 10.90.2.144 Enabling instance 10.90.3.53 Enabling instance 10.90.1.216 Enabling instance 10.90.3.65 Enable 10.90.4.114 success Enable 10.90.1.254 success Enable 10.90.3.65 success Enable 10.90.2.107 success Enable 10.90.1.216 success Enable 10.90.2.172 success Enable 10.90.3.53 success Enable 10.90.2.144 success Enable 10.90.1.143 successCluster `tidb-demo` deployed successfully, you can start it with command: `tiup cluster start tidb-demo --init`
复制代码


启动数据库实例


[ec2-user@ip-10-90-3-53 ~]$ tiup cluster start tidb-demo
// 启动成功后,输出以下信息+ [ Serial ] - UpdateTopology: cluster=tidb-demoStarted cluster `tidb-demo` successfully
复制代码


查看数据库组件状态信息


[ec2-user@ip-10-90-3-53 ~]$ tiup cluster display tidb-demoCluster type:       tidbCluster name:       tidb-demoCluster version:    v8.1.1Deploy user:        ec2-userSSH type:           builtinDashboard URL:      http://52.221.245.22:2379/dashboardGrafana URL:        http://10.90.4.114:3000ID                 Role          Host         Ports        OS/Arch       Status  Data Dir                      Deploy Dir--                 ----          ----         -----        -------       ------  --------                      ----------10.90.4.114:9093   alertmanager  10.90.4.114  9093/9094    linux/x86_64  Up      /tidb-data/alertmanager-9093  /tidb-deploy/alertmanager-909310.90.4.114:3000   grafana       10.90.4.114  3000         linux/x86_64  Up      -                             /tidb-deploy/grafana-300010.90.1.143:2379   pd            10.90.1.143  2379/2380    linux/x86_64  Up|L    /tidb-data/pd-2379            /tidb-deploy/pd-237910.90.2.107:2379   pd            10.90.2.107  2379/2380    linux/x86_64  Up      /tidb-data/pd-2379            /tidb-deploy/pd-237910.90.3.53:2379    pd            10.90.3.53   2379/2380    linux/x86_64  Up      /tidb-data/pd-2379            /tidb-deploy/pd-237910.90.4.114:9090   prometheus    10.90.4.114  9090/12020   linux/x86_64  Up      /tidb-data/prometheus-9090    /tidb-deploy/prometheus-909010.90.1.216:4000   tidb          10.90.1.216  4000/10080   linux/x86_64  Up      -                             /tidb-deploy/tidb-400010.90.2.172:4000   tidb          10.90.2.172  4000/10080   linux/x86_64  Up      -                             /tidb-deploy/tidb-400010.90.1.254:20160  tikv          10.90.1.254  20160/20180  linux/x86_64  Up      /tidb-data/tikv-20160         /tidb-deploy/tikv-2016010.90.2.144:20160  tikv          10.90.2.144  20160/20180  linux/x86_64  Up      /tidb-data/tikv-20160         /tidb-deploy/tikv-2016010.90.3.65:20160   tikv          10.90.3.65   20160/20180  linux/x86_64  Up      /tidb-data/tikv-20160         /tidb-deploy/tikv-20160
复制代码

2. 部署应用程序

设置 ip 环境变量,java 程序要使用


// IP是tidb的可登录地址export HOST_DB1_PRIVATE_IP=10.90.2.172export HOST_DB2_PRIVATE_IP=10.90.1.216
复制代码


编译和执行 java 程序


javac -cp .:misc/mysql-connector-java-5.1.36-bin.jar DemoJdbcEndlessInsertDummyCSP.javajava -cp .:misc/mysql-connector-java-5.1.36-bin.jar DemoJdbcEndlessInsertDummyCSP
复制代码


执行成功后, 会打印两个线程的工作内容



启动另一个终端, 监控数据库表中的数据变化


qdb1(){mysql -h ${HOST_DB1_PRIVATE_IP} -P 4000 -uroot --connect-timeout 1 2>/dev/null << EOF  SELECT COUNT(event) FROM test.dummy;EOF}
qdb2(){mysql -h ${HOST_DB2_PRIVATE_IP} -P 4000 -uroot --connect-timeout 1 2>/dev/null << EOF SELECT COUNT(event) FROM test.dummy;EOF}
query1(){ echo; date; qdb1 || qdb2 sleep 2;}
query2(){ echo; date; qdb2 || qdb1 sleep 2;}
while true; do query1; query2;done;
复制代码


执行成功后, 显示以下信息



登录数据库查看 pd 的状态


3. 模拟可用区故障

使用实验上给命令模拟可用区故障


[ec2-user@ip-10-90-3-53 ~]$ VPC_ID=`aws ec2 describe-vpcs \>   --filters "Name=tag:Name,Values=wqlixueyang-hotmail-com" \>   --query "Vpcs[0].VpcId" \>   --output text \>   --region ap-southeast-1`  --output text \  --region ap-southeast-1`CAGE_NACL_ID=`aws ec2 describe-network-acls \  --filters "Name=vpc-id,Values=${VPC_ID}" "Name=default,Values=false" "Name=tag:Name,Values=cage" \  --query "NetworkAcls[0].NetworkAclId" \  --output text \  --region ap-southeast-1`ASSOC_ID=`aws ec2 describe-network-acls \  --filters "Name=vpc-id,Values=${VPC_ID}" "Name=default,Values=true" \  --query "NetworkAcls[0].Associations" \  --output text \  --region ap-southeast-1 | grep ${SUBNET_ID} | awk -F" " '{print $1}'`aws ec2 replace-network-acl-association \  --association-id ${ASSOC_ID} \  --network-acl-id ${CAGE_NACL_ID} \  --region ap-southeast-1[ec2-user@ip-10-90-3-53 ~]$ SUBNET_ID=`aws ec2 describe-subnets \>   --filters "Name=vpc-id,Values=${VPC_ID}" "Name=tag:Name,Values=demo-subnet-1" \>   --query "Subnets[0].SubnetId" \>   --output text \>   --region ap-southeast-1`[ec2-user@ip-10-90-3-53 ~]$ CAGE_NACL_ID=`aws ec2 describe-network-acls \>   --filters "Name=vpc-id,Values=${VPC_ID}" "Name=default,Values=false" "Name=tag:Name,Values=cage" \>   --query "NetworkAcls[0].NetworkAclId" \>   --output text \>   --region ap-southeast-1`[ec2-user@ip-10-90-3-53 ~]$ ASSOC_ID=`aws ec2 describe-network-acls \>   --filters "Name=vpc-id,Values=${VPC_ID}" "Name=default,Values=true" \>   --query "NetworkAcls[0].Associations" \>   --output text \>   --region ap-southeast-1 | grep ${SUBNET_ID} | awk -F" " '{print $1}'`[ec2-user@ip-10-90-3-53 ~]$ aws ec2 replace-network-acl-association \>   --association-id ${ASSOC_ID} \>   --network-acl-id ${CAGE_NACL_ID} \>   --region ap-southeast-1{    "NewAssociationId": "aclassoc-0956a8b18b9835948"}
复制代码


查看集群状态


[ec2-user@ip-10-90-3-53 ~]$ tiup cluster display tidb-demoCluster type:       tidbCluster name:       tidb-demoCluster version:    v8.1.1Deploy user:        ec2-userSSH type:           builtinDashboard URL:      http://52.221.245.22:2379/dashboardGrafana URL:        http://10.90.4.114:3000ID                 Role          Host         Ports        OS/Arch       Status        Data Dir                      Deploy Dir--                 ----          ----         -----        -------       ------        --------                      ----------10.90.4.114:9093   alertmanager  10.90.4.114  9093/9094    linux/x86_64  Up            /tidb-data/alertmanager-9093  /tidb-deploy/alertmanager-909310.90.4.114:3000   grafana       10.90.4.114  3000         linux/x86_64  Up            -                             /tidb-deploy/grafana-300010.90.1.143:2379   pd            10.90.1.143  2379/2380    linux/x86_64  Down          /tidb-data/pd-2379            /tidb-deploy/pd-237910.90.2.107:2379   pd            10.90.2.107  2379/2380    linux/x86_64  Up|L          /tidb-data/pd-2379            /tidb-deploy/pd-237910.90.3.53:2379    pd            10.90.3.53   2379/2380    linux/x86_64  Up            /tidb-data/pd-2379            /tidb-deploy/pd-237910.90.4.114:9090   prometheus    10.90.4.114  9090/12020   linux/x86_64  Up            /tidb-data/prometheus-9090    /tidb-deploy/prometheus-909010.90.1.216:4000   tidb          10.90.1.216  4000/10080   linux/x86_64  Down          -                             /tidb-deploy/tidb-400010.90.2.172:4000   tidb          10.90.2.172  4000/10080   linux/x86_64  Up            -                             /tidb-deploy/tidb-400010.90.1.254:20160  tikv          10.90.1.254  20160/20180  linux/x86_64  Disconnected  /tidb-data/tikv-20160         /tidb-deploy/tikv-2016010.90.2.144:20160  tikv          10.90.2.144  20160/20180  linux/x86_64  Up            /tidb-data/tikv-20160         /tidb-deploy/tikv-2016010.90.3.65:20160   tikv          10.90.3.65   20160/20180  linux/x86_64  Up            /tidb-data/tikv-20160         /tidb-deploy/tikv-20160Total nodes: 11
复制代码


停止成功后,10.90.1.216 的流量被无缝切换到 10.90.2.172 上.



接下来通过脚本启动对应的可用区


[ec2-user@ip-10-90-3-53 ~]$ VPC_ID=`aws ec2 describe-vpcs \>   --filters "Name=tag:Name,Values=wqlixueyang-hotmail-com" \>   --query "Vpcs[0].VpcId" \>   --output text \>   --region ap-southeast-1`SUBNET_ID=`aws ec2 describe-subnets \  --filters "Name=vpc-id,Values=${VPC_ID}" "Name=tag:Name,Values=demo-subnet-1" \  --query "Subnets[0].SubnetId" \  --output text \  --region ap-southeast-1`DEFAULT_NACL_ID=`aws ec2 describe-network-acls \  --filters "Name=vpc-id,Values=${VPC_ID}" "Name=default,Values=true" \  --query "NetworkAcls[0].NetworkAclId" \  --output text \  --region ap-southeast-1`ASSOC_ID=`aws ec2 describe-network-acls \  --filters "Name=vpc-id,Values=${VPC_ID}" "Name=default,Values=false" \  --query "NetworkAcls[0].Associations" \  --output text \  --region ap-southeast-1 | grep ${SUBNET_ID} | awk -F" " '{print $1}'`aws ec2 replace-network-acl-association \  --association-id ${ASSOC_ID} \  --network-acl-id ${DEFAULT_NACL_ID} \  --region ap-southeast-1[ec2-user@ip-10-90-3-53 ~]$ SUBNET_ID=`aws ec2 describe-subnets \>   --filters "Name=vpc-id,Values=${VPC_ID}" "Name=tag:Name,Values=demo-subnet-1" \>   --query "Subnets[0].SubnetId" \>   --output text \>   --region ap-southeast-1`[ec2-user@ip-10-90-3-53 ~]$ DEFAULT_NACL_ID=`aws ec2 describe-network-acls \>   --filters "Name=vpc-id,Values=${VPC_ID}" "Name=default,Values=true" \>   --query "NetworkAcls[0].NetworkAclId" \>   --output text \>   --region ap-southeast-1`[ec2-user@ip-10-90-3-53 ~]$ ASSOC_ID=`aws ec2 describe-network-acls \>   --filters "Name=vpc-id,Values=${VPC_ID}" "Name=default,Values=false" \>   --query "NetworkAcls[0].Associations" \>   --output text \>   --region ap-southeast-1 | grep ${SUBNET_ID} | awk -F" " '{print $1}'`[ec2-user@ip-10-90-3-53 ~]$ aws ec2 replace-network-acl-association \>   --association-id ${ASSOC_ID} \>   --network-acl-id ${DEFAULT_NACL_ID} \>   --region ap-southeast-1{    "NewAssociationId": "aclassoc-05f1983f8ad847144"}
复制代码


集群状态已经正常.


[ec2-user@ip-10-90-3-53 ~]$ tiup cluster display tidb-demoCluster type:       tidbCluster name:       tidb-demoCluster version:    v8.1.1Deploy user:        ec2-userSSH type:           builtinDashboard URL:      http://52.221.245.22:2379/dashboardGrafana URL:        http://10.90.4.114:3000ID                 Role          Host         Ports        OS/Arch       Status  Data Dir                      Deploy Dir--                 ----          ----         -----        -------       ------  --------                      ----------10.90.4.114:9093   alertmanager  10.90.4.114  9093/9094    linux/x86_64  Up      /tidb-data/alertmanager-9093  /tidb-deploy/alertmanager-909310.90.4.114:3000   grafana       10.90.4.114  3000         linux/x86_64  Up      -                             /tidb-deploy/grafana-300010.90.1.143:2379   pd            10.90.1.143  2379/2380    linux/x86_64  Up      /tidb-data/pd-2379            /tidb-deploy/pd-237910.90.2.107:2379   pd            10.90.2.107  2379/2380    linux/x86_64  Up|L    /tidb-data/pd-2379            /tidb-deploy/pd-237910.90.3.53:2379    pd            10.90.3.53   2379/2380    linux/x86_64  Up      /tidb-data/pd-2379            /tidb-deploy/pd-237910.90.4.114:9090   prometheus    10.90.4.114  9090/12020   linux/x86_64  Up      /tidb-data/prometheus-9090    /tidb-deploy/prometheus-909010.90.1.216:4000   tidb          10.90.1.216  4000/10080   linux/x86_64  Up      -                             /tidb-deploy/tidb-400010.90.2.172:4000   tidb          10.90.2.172  4000/10080   linux/x86_64  Up      -                             /tidb-deploy/tidb-400010.90.1.254:20160  tikv          10.90.1.254  20160/20180  linux/x86_64  Up      /tidb-data/tikv-20160         /tidb-deploy/tikv-2016010.90.2.144:20160  tikv          10.90.2.144  20160/20180  linux/x86_64  Up      /tidb-data/tikv-20160         /tidb-deploy/tikv-2016010.90.3.65:20160   tikv          10.90.3.65   20160/20180  linux/x86_64  Up      /tidb-data/tikv-20160         /tidb-deploy/tikv-20160Total nodes: 11
复制代码


查看业务流量重新平衡到两个 IP 上.


4. 总结

通过云环境模拟可用区故障导致部分实例不可用状态, 业务流量可以平滑的迁移到其他可用区, 并且不会对业务造成影响.


发布于: 刚刚阅读数: 3
用户头像

TiDB 社区官网:https://tidb.net/ 2021-12-15 加入

TiDB 社区干货传送门是由 TiDB 社区中布道师组委会自发组织的 TiDB 社区优质内容对外宣布的栏目,旨在加深 TiDBer 之间的交流和学习。一起构建有爱、互助、共创共建的 TiDB 社区 https://tidb.net/

评论

发布
暂无评论
TiDB Labs云环境测试故障期间数据库零宕机_实践案例_TiDB 社区干货传送门_InfoQ写作社区