TiDB Labs 云环境测试故障期间数据库零宕机
- 2025-02-28 北京
本文字数:7859 字
阅读完需:约 26 分钟
作者: EINTR 原文来源:https://tidb.net/blog/b624ac58
TiDB Labs从连接登录 TiDB Labs, 选择 ” 可用区故障期间数据库零宕机 ”, 支付对应的积分后, 启动实验环境, 等待几分钟后, 实验环境创建完成后, 可以的到私钥 key 和对应的主机 ip 地址.

1. 数据库环境配置
登录主机后, 使用配置模板配置修改成部署数据库的配置文件.
cp template-nine-nodes.yaml nine-nodes.yaml
[ec2-user@ip-10-90-3-53 ~]$ cat nine-nodes.yaml
global:
user: "ec2-user"
ssh_port: 22
deploy_dir: "/tidb-deploy"
data_dir: "/tidb-data"
arch: "amd64"
server_configs:
pd:
replication.max-replicas: 3
pd_servers:
- host: 10.90.3.53
client_port: 2379
peer_port: 2380
advertise_client_addr: "http://13.212.46.63:2379"
- host: 10.90.2.107
client_port: 2379
peer_port: 2380
advertise_client_addr: "http://52.221.245.22:2379"
- host: 10.90.1.143
client_port: 2379
peer_port: 2380
advertise_client_addr: "http://47.129.246.237:2379"
tidb_servers:
- host: 10.90.2.172
port: 4000
status_port: 10080
deploy_dir: /tidb-deploy/tidb-4000
log_dir: /tidb-deploy/tidb-4000/log
- host: 10.90.1.216
port: 4000
status_port: 10080
deploy_dir: /tidb-deploy/tidb-4000
log_dir: /tidb-deploy/tidb-4000/log
tikv_servers:
- host:10.90.3.65
port: 20160
status_port: 20180
- host: 10.90.2.144
port: 20160
status_port: 20180
- host: 10.90.1.254
port: 20160
status_port: 20180
monitoring_servers:
- host: 10.90.4.114
grafana_servers:
- host: 10.90.4.114
alertmanager_servers:
- host: 10.90.4.114
检查配置文件有效性, 确保配置文件没有错误.
tiup mirror set tidb-community-server-8.1.1-linux-amd64
tiup cluster check nine-nodes.yaml \
--user ec2-user \
-i /home/ec2-user/.ssh/pe-class-key-singapore.pem \
--apply
// 检查完成后的输出
+ Try to apply changes to fix failed checks
- Applying changes on 10.90.3.53 ... ⠧ Sysctl: host=10.90.3.53 vm.swappiness = 0
- Applying changes on 10.90.1.143 ... ⠧ Sysctl: host=10.90.1.143 vm.swappiness = 0
- Applying changes on 10.90.1.254 ... ⠧ Sysctl: host=10.90.1.254 vm.swappiness = 0
- Applying changes on 10.90.4.114 ... ⠧ Sysctl: host=10.90.4.114 vm.swappiness = 0
- Applying changes on 10.90.2.107 ... ⠧ Sysctl: host=10.90.2.107 vm.swappiness = 0
- Applying changes on 10.90.3.65 ... ⠧
- Applying changes on 10.90.2.144 ... ⠧
- Applying changes on 10.90.2.172 ... ⠧
- Applying changes on 10.90.1.216 ... ⠧
+ Try to apply changes to fix failed checks
- Applying changes on 10.90.3.53 ... Done
- Applying changes on 10.90.1.143 ... Done
- Applying changes on 10.90.1.254 ... Done
- Applying changes on 10.90.4.114 ... Done
- Applying changes on 10.90.2.107 ... Done
+ Try to apply changes to fix failed checks
- Applying changes on 10.90.3.53 ... Done
+ Try to apply changes to fix failed checks
- Applying changes on 10.90.3.53 ... Done
- Applying changes on 10.90.1.143 ... Done
- Applying changes on 10.90.1.254 ... Done
- Applying changes on 10.90.4.114 ... Done
- Applying changes on 10.90.2.107 ... Done
- Applying changes on 10.90.3.65 ... Done
- Applying changes on 10.90.2.144 ... Done
- Applying changes on 10.90.2.172 ... Done
- Applying changes on 10.90.1.216 ... Done
检查成功后, 使用配置文件部署数据库
tiup cluster deploy tidb-demo 8.1.1 ./nine-nodes.yaml \
--user ec2-user \
-i /home/ec2-user/.ssh/pe-class-key-singapore.pem \
--yes
// 部署成功后,输入以下信息
Enabling component blackbox_exporter
Enabling instance 10.90.4.114
Enabling instance 10.90.2.107
Enabling instance 10.90.1.254
Enabling instance 10.90.1.143
Enabling instance 10.90.2.172
Enabling instance 10.90.2.144
Enabling instance 10.90.3.53
Enabling instance 10.90.1.216
Enabling instance 10.90.3.65
Enable 10.90.4.114 success
Enable 10.90.1.254 success
Enable 10.90.3.65 success
Enable 10.90.2.107 success
Enable 10.90.1.216 success
Enable 10.90.2.172 success
Enable 10.90.3.53 success
Enable 10.90.2.144 success
Enable 10.90.1.143 success
Cluster `tidb-demo` deployed successfully, you can start it with command: `tiup cluster start tidb-demo --init`
启动数据库实例
[ec2-user@ip-10-90-3-53 ~]$ tiup cluster start tidb-demo
// 启动成功后,输出以下信息
+ [ Serial ] - UpdateTopology: cluster=tidb-demo
Started cluster `tidb-demo` successfully
查看数据库组件状态信息
[ec2-user@ip-10-90-3-53 ~]$ tiup cluster display tidb-demo
Cluster type: tidb
Cluster name: tidb-demo
Cluster version: v8.1.1
Deploy user: ec2-user
SSH type: builtin
Dashboard URL: http://52.221.245.22:2379/dashboard
Grafana URL: http://10.90.4.114:3000
ID Role Host Ports OS/Arch Status Data Dir Deploy Dir
-- ---- ---- ----- ------- ------ -------- ----------
10.90.4.114:9093 alertmanager 10.90.4.114 9093/9094 linux/x86_64 Up /tidb-data/alertmanager-9093 /tidb-deploy/alertmanager-9093
10.90.4.114:3000 grafana 10.90.4.114 3000 linux/x86_64 Up - /tidb-deploy/grafana-3000
10.90.1.143:2379 pd 10.90.1.143 2379/2380 linux/x86_64 Up|L /tidb-data/pd-2379 /tidb-deploy/pd-2379
10.90.2.107:2379 pd 10.90.2.107 2379/2380 linux/x86_64 Up /tidb-data/pd-2379 /tidb-deploy/pd-2379
10.90.3.53:2379 pd 10.90.3.53 2379/2380 linux/x86_64 Up /tidb-data/pd-2379 /tidb-deploy/pd-2379
10.90.4.114:9090 prometheus 10.90.4.114 9090/12020 linux/x86_64 Up /tidb-data/prometheus-9090 /tidb-deploy/prometheus-9090
10.90.1.216:4000 tidb 10.90.1.216 4000/10080 linux/x86_64 Up - /tidb-deploy/tidb-4000
10.90.2.172:4000 tidb 10.90.2.172 4000/10080 linux/x86_64 Up - /tidb-deploy/tidb-4000
10.90.1.254:20160 tikv 10.90.1.254 20160/20180 linux/x86_64 Up /tidb-data/tikv-20160 /tidb-deploy/tikv-20160
10.90.2.144:20160 tikv 10.90.2.144 20160/20180 linux/x86_64 Up /tidb-data/tikv-20160 /tidb-deploy/tikv-20160
10.90.3.65:20160 tikv 10.90.3.65 20160/20180 linux/x86_64 Up /tidb-data/tikv-20160 /tidb-deploy/tikv-20160
2. 部署应用程序
设置 ip 环境变量,java 程序要使用
// IP是tidb的可登录地址
export HOST_DB1_PRIVATE_IP=10.90.2.172
export HOST_DB2_PRIVATE_IP=10.90.1.216
编译和执行 java 程序
javac -cp .:misc/mysql-connector-java-5.1.36-bin.jar DemoJdbcEndlessInsertDummyCSP.java
java -cp .:misc/mysql-connector-java-5.1.36-bin.jar DemoJdbcEndlessInsertDummyCSP
执行成功后, 会打印两个线程的工作内容

启动另一个终端, 监控数据库表中的数据变化
qdb1(){
mysql -h ${HOST_DB1_PRIVATE_IP} -P 4000 -uroot --connect-timeout 1 2>/dev/null << EOF
SELECT COUNT(event) FROM test.dummy;
EOF
}
qdb2(){
mysql -h ${HOST_DB2_PRIVATE_IP} -P 4000 -uroot --connect-timeout 1 2>/dev/null << EOF
SELECT COUNT(event) FROM test.dummy;
EOF
}
query1(){
echo;
date;
qdb1 || qdb2
sleep 2;
}
query2(){
echo;
date;
qdb2 || qdb1
sleep 2;
}
while true; do
query1;
query2;
done;
执行成功后, 显示以下信息

登录数据库查看 pd 的状态

3. 模拟可用区故障
使用实验上给命令模拟可用区故障
[ec2-user@ip-10-90-3-53 ~]$ VPC_ID=`aws ec2 describe-vpcs \
> --filters "Name=tag:Name,Values=wqlixueyang-hotmail-com" \
> --query "Vpcs[0].VpcId" \
> --output text \
> --region ap-southeast-1`
--output text \
--region ap-southeast-1`
CAGE_NACL_ID=`aws ec2 describe-network-acls \
--filters "Name=vpc-id,Values=${VPC_ID}" "Name=default,Values=false" "Name=tag:Name,Values=cage" \
--query "NetworkAcls[0].NetworkAclId" \
--output text \
--region ap-southeast-1`
ASSOC_ID=`aws ec2 describe-network-acls \
--filters "Name=vpc-id,Values=${VPC_ID}" "Name=default,Values=true" \
--query "NetworkAcls[0].Associations" \
--output text \
--region ap-southeast-1 | grep ${SUBNET_ID} | awk -F" " '{print $1}'`
aws ec2 replace-network-acl-association \
--association-id ${ASSOC_ID} \
--network-acl-id ${CAGE_NACL_ID} \
--region ap-southeast-1
[ec2-user@ip-10-90-3-53 ~]$ SUBNET_ID=`aws ec2 describe-subnets \
> --filters "Name=vpc-id,Values=${VPC_ID}" "Name=tag:Name,Values=demo-subnet-1" \
> --query "Subnets[0].SubnetId" \
> --output text \
> --region ap-southeast-1`
[ec2-user@ip-10-90-3-53 ~]$ CAGE_NACL_ID=`aws ec2 describe-network-acls \
> --filters "Name=vpc-id,Values=${VPC_ID}" "Name=default,Values=false" "Name=tag:Name,Values=cage" \
> --query "NetworkAcls[0].NetworkAclId" \
> --output text \
> --region ap-southeast-1`
[ec2-user@ip-10-90-3-53 ~]$ ASSOC_ID=`aws ec2 describe-network-acls \
> --filters "Name=vpc-id,Values=${VPC_ID}" "Name=default,Values=true" \
> --query "NetworkAcls[0].Associations" \
> --output text \
> --region ap-southeast-1 | grep ${SUBNET_ID} | awk -F" " '{print $1}'`
[ec2-user@ip-10-90-3-53 ~]$ aws ec2 replace-network-acl-association \
> --association-id ${ASSOC_ID} \
> --network-acl-id ${CAGE_NACL_ID} \
> --region ap-southeast-1
{
"NewAssociationId": "aclassoc-0956a8b18b9835948"
}
查看集群状态
[ec2-user@ip-10-90-3-53 ~]$ tiup cluster display tidb-demo
Cluster type: tidb
Cluster name: tidb-demo
Cluster version: v8.1.1
Deploy user: ec2-user
SSH type: builtin
Dashboard URL: http://52.221.245.22:2379/dashboard
Grafana URL: http://10.90.4.114:3000
ID Role Host Ports OS/Arch Status Data Dir Deploy Dir
-- ---- ---- ----- ------- ------ -------- ----------
10.90.4.114:9093 alertmanager 10.90.4.114 9093/9094 linux/x86_64 Up /tidb-data/alertmanager-9093 /tidb-deploy/alertmanager-9093
10.90.4.114:3000 grafana 10.90.4.114 3000 linux/x86_64 Up - /tidb-deploy/grafana-3000
10.90.1.143:2379 pd 10.90.1.143 2379/2380 linux/x86_64 Down /tidb-data/pd-2379 /tidb-deploy/pd-2379
10.90.2.107:2379 pd 10.90.2.107 2379/2380 linux/x86_64 Up|L /tidb-data/pd-2379 /tidb-deploy/pd-2379
10.90.3.53:2379 pd 10.90.3.53 2379/2380 linux/x86_64 Up /tidb-data/pd-2379 /tidb-deploy/pd-2379
10.90.4.114:9090 prometheus 10.90.4.114 9090/12020 linux/x86_64 Up /tidb-data/prometheus-9090 /tidb-deploy/prometheus-9090
10.90.1.216:4000 tidb 10.90.1.216 4000/10080 linux/x86_64 Down - /tidb-deploy/tidb-4000
10.90.2.172:4000 tidb 10.90.2.172 4000/10080 linux/x86_64 Up - /tidb-deploy/tidb-4000
10.90.1.254:20160 tikv 10.90.1.254 20160/20180 linux/x86_64 Disconnected /tidb-data/tikv-20160 /tidb-deploy/tikv-20160
10.90.2.144:20160 tikv 10.90.2.144 20160/20180 linux/x86_64 Up /tidb-data/tikv-20160 /tidb-deploy/tikv-20160
10.90.3.65:20160 tikv 10.90.3.65 20160/20180 linux/x86_64 Up /tidb-data/tikv-20160 /tidb-deploy/tikv-20160
Total nodes: 11
停止成功后,10.90.1.216 的流量被无缝切换到 10.90.2.172 上.

接下来通过脚本启动对应的可用区
[ec2-user@ip-10-90-3-53 ~]$ VPC_ID=`aws ec2 describe-vpcs \
> --filters "Name=tag:Name,Values=wqlixueyang-hotmail-com" \
> --query "Vpcs[0].VpcId" \
> --output text \
> --region ap-southeast-1`
SUBNET_ID=`aws ec2 describe-subnets \
--filters "Name=vpc-id,Values=${VPC_ID}" "Name=tag:Name,Values=demo-subnet-1" \
--query "Subnets[0].SubnetId" \
--output text \
--region ap-southeast-1`
DEFAULT_NACL_ID=`aws ec2 describe-network-acls \
--filters "Name=vpc-id,Values=${VPC_ID}" "Name=default,Values=true" \
--query "NetworkAcls[0].NetworkAclId" \
--output text \
--region ap-southeast-1`
ASSOC_ID=`aws ec2 describe-network-acls \
--filters "Name=vpc-id,Values=${VPC_ID}" "Name=default,Values=false" \
--query "NetworkAcls[0].Associations" \
--output text \
--region ap-southeast-1 | grep ${SUBNET_ID} | awk -F" " '{print $1}'`
aws ec2 replace-network-acl-association \
--association-id ${ASSOC_ID} \
--network-acl-id ${DEFAULT_NACL_ID} \
--region ap-southeast-1
[ec2-user@ip-10-90-3-53 ~]$ SUBNET_ID=`aws ec2 describe-subnets \
> --filters "Name=vpc-id,Values=${VPC_ID}" "Name=tag:Name,Values=demo-subnet-1" \
> --query "Subnets[0].SubnetId" \
> --output text \
> --region ap-southeast-1`
[ec2-user@ip-10-90-3-53 ~]$ DEFAULT_NACL_ID=`aws ec2 describe-network-acls \
> --filters "Name=vpc-id,Values=${VPC_ID}" "Name=default,Values=true" \
> --query "NetworkAcls[0].NetworkAclId" \
> --output text \
> --region ap-southeast-1`
[ec2-user@ip-10-90-3-53 ~]$ ASSOC_ID=`aws ec2 describe-network-acls \
> --filters "Name=vpc-id,Values=${VPC_ID}" "Name=default,Values=false" \
> --query "NetworkAcls[0].Associations" \
> --output text \
> --region ap-southeast-1 | grep ${SUBNET_ID} | awk -F" " '{print $1}'`
[ec2-user@ip-10-90-3-53 ~]$ aws ec2 replace-network-acl-association \
> --association-id ${ASSOC_ID} \
> --network-acl-id ${DEFAULT_NACL_ID} \
> --region ap-southeast-1
{
"NewAssociationId": "aclassoc-05f1983f8ad847144"
}
集群状态已经正常.
[ec2-user@ip-10-90-3-53 ~]$ tiup cluster display tidb-demo
Cluster type: tidb
Cluster name: tidb-demo
Cluster version: v8.1.1
Deploy user: ec2-user
SSH type: builtin
Dashboard URL: http://52.221.245.22:2379/dashboard
Grafana URL: http://10.90.4.114:3000
ID Role Host Ports OS/Arch Status Data Dir Deploy Dir
-- ---- ---- ----- ------- ------ -------- ----------
10.90.4.114:9093 alertmanager 10.90.4.114 9093/9094 linux/x86_64 Up /tidb-data/alertmanager-9093 /tidb-deploy/alertmanager-9093
10.90.4.114:3000 grafana 10.90.4.114 3000 linux/x86_64 Up - /tidb-deploy/grafana-3000
10.90.1.143:2379 pd 10.90.1.143 2379/2380 linux/x86_64 Up /tidb-data/pd-2379 /tidb-deploy/pd-2379
10.90.2.107:2379 pd 10.90.2.107 2379/2380 linux/x86_64 Up|L /tidb-data/pd-2379 /tidb-deploy/pd-2379
10.90.3.53:2379 pd 10.90.3.53 2379/2380 linux/x86_64 Up /tidb-data/pd-2379 /tidb-deploy/pd-2379
10.90.4.114:9090 prometheus 10.90.4.114 9090/12020 linux/x86_64 Up /tidb-data/prometheus-9090 /tidb-deploy/prometheus-9090
10.90.1.216:4000 tidb 10.90.1.216 4000/10080 linux/x86_64 Up - /tidb-deploy/tidb-4000
10.90.2.172:4000 tidb 10.90.2.172 4000/10080 linux/x86_64 Up - /tidb-deploy/tidb-4000
10.90.1.254:20160 tikv 10.90.1.254 20160/20180 linux/x86_64 Up /tidb-data/tikv-20160 /tidb-deploy/tikv-20160
10.90.2.144:20160 tikv 10.90.2.144 20160/20180 linux/x86_64 Up /tidb-data/tikv-20160 /tidb-deploy/tikv-20160
10.90.3.65:20160 tikv 10.90.3.65 20160/20180 linux/x86_64 Up /tidb-data/tikv-20160 /tidb-deploy/tikv-20160
Total nodes: 11
查看业务流量重新平衡到两个 IP 上.

4. 总结
通过云环境模拟可用区故障导致部分实例不可用状态, 业务流量可以平滑的迁移到其他可用区, 并且不会对业务造成影响.
版权声明: 本文为 InfoQ 作者【TiDB 社区干货传送门】的原创文章。
原文链接:【http://xie.infoq.cn/article/45b618b73de476472b8551192】。文章转载请联系作者。

TiDB 社区干货传送门
TiDB 社区官网:https://tidb.net/ 2021-12-15 加入
TiDB 社区干货传送门是由 TiDB 社区中布道师组委会自发组织的 TiDB 社区优质内容对外宣布的栏目,旨在加深 TiDBer 之间的交流和学习。一起构建有爱、互助、共创共建的 TiDB 社区 https://tidb.net/
评论