写点什么

GreatSQL 一个关于主从复制的限制描述与规避

作者:GreatSQL
  • 2023-09-27
    福建
  • 本文字数:5688 字

    阅读完需:约 19 分钟

一、背景

分享一个在项目运维中遇到的一个主从复制限制的一个坑,项目的架构为主集群+灾备集群,每个集群为一主两从模式。主集群到灾备集群的同步为主从复制的方式,根据业务需求灾备集群需要忽略系统库跟某些配置表,所以才会触发此限制,而这个限制如果我们之前没有遇到过,那么排查起来也是相对不易的。

二、限制描述

1、主从同步出现报错

greatsql> show slave status\G*************************** 1. row ***************************               Slave_IO_State: Waiting for master to send event                  Master_Host: 192.168.xxx.xxx                  Master_User: repl                  Master_Port: 3306                Connect_Retry: 60              Master_Log_File: greatsql-bin.000990          Read_Master_Log_Pos: 92274290               Relay_Log_File: greatsql-relay.002963     -----                Relay_Log_Pos: 701548899        Relay_Master_Log_File: greatsql-bin.000988             Slave_IO_Running: Yes            Slave_SQL_Running: No              Replicate_Do_DB:          Replicate_Ignore_DB: mysql,dbscale,dbscale_tmp,information_schema,performance_schema,sys           Replicate_Do_Table:       Replicate_Ignore_Table:      Replicate_Wild_Do_Table:  Replicate_Wild_Ignore_Table: A.ab,B.bc                   Last_Errno: 1146                   Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '9e668a93-2618-11ee-93ee-bc16954181bb:47508257' at master log greatdb-bin.000988, end_log_pos 701570116. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.                 Skip_Counter: 0          Exec_Master_Log_Pos: 701548690              Relay_Log_Space: 2246320360              Until_Condition: None               Until_Log_File:                Until_Log_Pos: 0           Master_SSL_Allowed: No           Master_SSL_CA_File:           Master_SSL_CA_Path:              Master_SSL_Cert:            Master_SSL_Cipher:               Master_SSL_Key:        Seconds_Behind_Master: NULLMaster_SSL_Verify_Server_Cert: No                Last_IO_Errno: 0                Last_IO_Error:               Last_SQL_Errno: 1146               Last_SQL_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '9e668a93-2618-11ee-93ee-bc16954181bb:47508257' at master log greatdb-bin.000988, end_log_pos 701570116. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.  Replicate_Ignore_Server_Ids:             Master_Server_Id: 1943306                  Master_UUID: 9e668a93-2618-11ee-93ee-bc16954181bb             Master_Info_File: mysql.slave_master_info                    SQL_Delay: 0          SQL_Remaining_Delay: NULL      Slave_SQL_Running_State:           Master_Retry_Count: 86400                  Master_Bind:      Last_IO_Error_Timestamp:     Last_SQL_Error_Timestamp: 230822 14:14:18               Master_SSL_Crl:           Master_SSL_Crlpath:           Retrieved_Gtid_Set: 9e668a93-2618-11ee-93ee-bc16954181bb:2-47565802            Executed_Gtid_Set: 30873cfe-8750-11ed-b56f-744aa4073024:1-270,9e668a93-2618-11ee-93ee-bc16954181bb:1-47508256                Auto_Position: 1         Replicate_Rewrite_DB:                   Channel_Name:           Master_TLS_Version:1 row in set (0.00 sec)
复制代码


根据 slave status 状态信息可以看出


  • 报错的 GTID 为:'9e668a93-2618-11ee-93ee-bc16954181bb:47508257'

  • 应用的主集群的 binlog 为:greatsql-bin.000988

  • 灾备集群的 relay log 为:greatsql-relay.002963


详细信息查看performance_schema.replication_applier_status_by_worker

2、查看错误的详细信息

greatsql> select * from performance_schema.replication_applier_status_by_worker\G*************************** 1. row ***************************         CHANNEL_NAME:            WORKER_ID: 1            THREAD_ID: NULL        SERVICE_STATE: OFFLAST_SEEN_TRANSACTION: 9e668a93-2618-11ee-93ee-bc16954181bb:47508257    LAST_ERROR_NUMBER: 1146   LAST_ERROR_MESSAGE: Worker 1 failed executing transaction'9e668a93-2618-11ee-93ee-bc16954181bb:47508257' at master log greatdb-bin.000988,end_log_pos 701570116; Error executing row event: 'Table 'abs_xxx.tmp_xxx_info' doesn't exist'LAST_ERROR_TIMESTAMP: 2023-08-22 14:14:18
复制代码


上述信息说明根据performance_schema.replication_applier_status_by_worker表中的详细错误信息可以发现为灾备集群abs_xxx.tmp_xxx_info表不存在,导致同步报错

3、问题分析

3.1、确认灾备集群中目标表是否存在


greatsql> show create table abs_xxx.tmp_xxx_info;ERROR 1146 (42S02): Table 'abs_xxx.tmp_xxx_info' doesn't existgreatsql> desc abs_xxx.tmp_xxx_info;ERROR 1146 (42S02): Table 'abs_xxx.tmp_xxx_info' doesn't exist
复制代码


**结论:**灾备集群中目标表的确不存在


3.2、根据主从报错信息解析主集群 binlog,报错的 SQL


解析主集群 binlog


SET @@SESSION.GTID_NEXT= '9e668a93-2618-11ee-93ee-bc16954181bb:47508257'/*!*/;……#230822 14:14:18 server id 1943306  end_log_pos 701570000         Table_map: `abs_xxx`.`tmp_xxx_info` mapped to number 1595# at 701570000#230822 14:14:18 server id 1943306  end_log_pos 701570116         Write_rows: table id 1595 flags: STMT_END_F### INSERT INTO `abs_xxx`.`tmp_xxx_info`### SET###   @1=2###   @2='自动化'###   @3='2300121212120000'###   @4='90000000'###   @5='1'###   @6='202001290231001'###   @7='2021-01-31 00:00:00'# at 701570116#230822 14:14:18 server id 1943306  end_log_pos 701570143         Xid = 800998400COMMIT/*!*/;# at 701570143#230822 14:14:18 server id 1943306  end_log_pos 701570204         GTID        last_committed=26491        sequence_number=26521        rbr_only=yes/*!50718 SET TRANSACTION ISOLATION LEVEL READ COMMITTED*//*!*/;
复制代码


**结论:**根据复制的报错信息得知具体的 GTID 号以及主集群的 binlog 文件,解析 binlog 得知此事务为一条 INSERT 语句,语句中的目标表与performance_schema.replication_applier_status_by_worker表中信息一致


3.3、寻找主集群目标表 binlog 中是否有建表语句


在同一 binlog 日志中寻找建表语句


SET TIMESTAMP=1692684495/*!*/;CREATE DATABASE IF NOT EXISTS `abs_xxx` /*!40100 DEFAULT CHARACTER SET utf8 COLLATE utf8_bin *//*!*/;……use `information_schema`/*!*/;SET TIMESTAMP=1692684495/*!*/;CREATE TABLE `abs_xxx`.`tmp_xxx_info` (  `ID` int(64) NOT NULL AUTO_INCREMENT,  `CUSTOMER_NAME` varchar(256) COLLATE utf8mb4_bin DEFAULT NULL,  `CUSTOMER_NUM` varchar(256) COLLATE utf8mb4_bin DEFAULT NULL,  `PRODIST_SKU_NUM` varchar(64) COLLATE utf8mb4_bin DEFAULT NULL,  `STATUS` varchar(64) COLLATE utf8mb4_bin DEFAULT NULL,  `AGREEMENT_NUM` varchar(256) COLLATE utf8mb4_bin DEFAULT NULL,  `END_DATE` datetime DEFAULT NULL,  PRIMARY KEY (`ID`),  KEY `PK_PRODIST_SKU_NUM_AGREEMENT` (`PRODIST_SKU_NUM`) USING BTREE,  KEY `IDX_AGREEMENT_NUM` (`AGREEMENT_NUM`) USING BTREE) ENGINE=InnoDB AUTO_INCREMENT=2 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin ROW_FORMAT=DYNAMIC/*!*/;# at 475864451
复制代码


**结论:**在主集群的 binlog 日志中找到了目标表的建表语句,说明主集群执行 DDL 时并没有关闭 binlog 日志,那么继续查看在灾备集群的中继日志中是否存在 DDL 语句


3.4、解析灾备集群的中继日志,确认是否拉取到灾备集群


#230822 14:08:15 server id 1943306  end_log_pos 475863662         GTID        last_committed=16341        sequence_number=16342        rbr_only=noSET @@SESSION.GTID_NEXT= '9e668a93-2618-11ee-93ee-bc16954181bb:47498079'/*!*/;……use `information_schema`/*!*/;SET TIMESTAMP=1692684495/*!*/;CREATE TABLE `abs_xxx`.`tmp_xxx_info` (  `ID` int(64) NOT NULL AUTO_INCREMENT,  `CUSTOMER_NAME` varchar(256) COLLATE utf8mb4_bin DEFAULT NULL,  `CUSTOMER_NUM` varchar(256) COLLATE utf8mb4_bin DEFAULT NULL,  `PRODIST_SKU_NUM` varchar(64) COLLATE utf8mb4_bin DEFAULT NULL,  `STATUS` varchar(64) COLLATE utf8mb4_bin DEFAULT NULL,  `AGREEMENT_NUM` varchar(256) COLLATE utf8mb4_bin DEFAULT NULL,  `END_DATE` datetime DEFAULT NULL,  PRIMARY KEY (`ID`),  KEY `PK_PRODIST_SKU_NUM_AGREEMENT` (`PRODIST_SKU_NUM`) USING BTREE,  KEY `IDX_AGREEMENT_NUM` (`AGREEMENT_NUM`) USING BTREE) ENGINE=InnoDB AUTO_INCREMENT=2 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin ROW_FORMAT=DYNAMIC/*!*/;# at 475864660#230822 14:08:15 server id 1943306  end_log_pos 475864512         GTID        last_committed=16342        sequence_number=16343        rbr_only=yes/*!50718 SET TRANSACTION ISOLATION LEVEL READ COMMITTED*//*!*/; 
复制代码


**结论:**灾备集群的中继日志中存在 DDL 建表语句,说明并不是 IO 线程出了问题


3.5、排查复制配置的忽略库表


Replicate_Ignore_DB: mysql,dbscale,dbscale_tmp,information_schema,performance_schema,sysReplicate_Wild_Ignore_Table: A.ab,B.bc
复制代码


**结论:**忽略库表中并不包含目标表,但是根据以上解析日志发现,在主集群 binlog 日志中建表语句之前有个use information_schema/!/; 的语句,此库为同步忽略的系统库,因此触发了 GreatSQL 的规范限制,在忽略库下对未忽略进行操作 Statement 模式下记录语句默认不起作用 (详情:https://dev.mysql.com/doc/refman/5.7/en/replication-options-replica.html#option_mysqld_replicate-do-db

4、解决同步报错

在灾备集群创建目标表


greatsql> CREATE TABLE `abs_xxx`.`tmp_xxx_info` (  `ID` int(64) NOT NULL AUTO_INCREMENT,  `CUSTOMER_NAME` varchar(256) COLLATE utf8mb4_bin DEFAULT NULL,  `CUSTOMER_NUM` varchar(256) COLLATE utf8mb4_bin DEFAULT NULL,  `PRODIST_SKU_NUM` varchar(64) COLLATE utf8mb4_bin DEFAULT NULL,  `STATUS` varchar(64) COLLATE utf8mb4_bin DEFAULT NULL,  `AGREEMENT_NUM` varchar(256) COLLATE utf8mb4_bin DEFAULT NULL,  `END_DATE` datetime DEFAULT NULL,  PRIMARY KEY (`ID`),  KEY `PK_PRODIST_SKU_NUM_AGREEMENT` (`PRODIST_SKU_NUM`) USING BTREE,  KEY `IDX_AGREEMENT_NUM` (`AGREEMENT_NUM`) USING BTREE) ENGINE=InnoDB AUTO_INCREMENT=2 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin ROW_FORMAT=DYNAMIC;
greatsql> stop slave;greatsql> start slave;
复制代码


**结论:**在灾备集群创建目标表后重启复制恢复成功

三、限制规避

1、第一种规避方式

执行 DDL 时进入目标库


greatsql> use abs_custgreatsql> DDL 语句(CREATE\DROP\ALTER)
复制代码


**说明:**在应用连接数据库时有可能默认就是information_schema库,而此环境将系统库全部忽略,所以为了规避类似的问题,请在执行 SQL 语句时请先 use 到目标表的目标库。

2、第二种规避方式

修改主从复制配置,以下步骤为测试环境


关闭灾备集群在复制同步


greatsql> stop slave;Query OK, 0 rows affected, 1 warning (0.03 sec)
复制代码


修改忽略库


greatsql> change replication filter Replicate_Ignore_DB=();
复制代码


修改忽略表


greatsql> change replication filter replicate_wild_ignore_table =('mysql.%','information_schema.%','sys.%','performance_schema.%');
复制代码


启动同步


greatsql> start slave;Query OK, 0 rows affected, 1 warning (0.37 sec)
复制代码


测试验证


主集群:


greatsql> use mysqlReading table information for completion of table and column namesYou can turn off this feature to get a quicker startup with -ADatabase changed
greatsql> create table test111.test111(id int primary key);Query OK, 0 rows affected (0.06 sec)
greatsql> show tables;+-------------------+| Tables_in_test111 |+-------------------+| test111 |+-------------------+1 row in set (0.00 sec)
复制代码


灾备集群:


greatsql> use test111Reading table information for completion of table and column namesYou can turn off this feature to get a quicker startup with -ADatabase changed
greatsql> show tables;+-------------------+| Tables_in_test111 |+-------------------+| test111 |+-------------------+1 row in set (0.00 sec)
复制代码


**说明:**复制配置中参数Replicate_Ignore_DB设置为空,将replicate_wild_ignore_table参数设置为shema_name.%的方式也可以规避类似的问题

四、特别说明

  • 在 MySQL 5.7 跟 8.0 版本也存在此限制

发布于: 刚刚阅读数: 3
用户头像

GreatSQL

关注

GreatSQL社区 2023-01-31 加入

GreatSQL是由万里数据库维护的MySQL分支,专注于提升MGR可靠性及性能,支持InnoDB并行查询特性,是适用于金融级应用的MySQL分支版本。 社区:https://greatsql.cn/ Gitee: https://gitee.com/GreatSQL/GreatSQL

评论

发布
暂无评论
GreatSQL一个关于主从复制的限制描述与规避_主从复制_GreatSQL_InfoQ写作社区