写点什么

5.7 与 8.0 对相同文件的 LOAD DATA 语句结果不同

作者:GreatSQL
  • 2024-11-14
    福建
  • 本文字数:7085 字

    阅读完需:约 23 分钟

5.7 与 8.0 对相同文件的 LOAD DATA 语句结果不同

问题描述

某客户现场支持,由 MySQL 5.7.21 升级 MySQL 8.0.25 后,通过LOAD DATA导入文件,当同一会话连续导入不同的编码(UTF8/GB18030)文件时会出现乱码。数据库版本未升级之前,相同的导入操作在 MySQL 5.7.21 未出现乱码。

问题分析

1)查看简化后的 LOAD DATA语句


greatsql> LOAD DATA LOCAL INFILE '/home/greatdb/TEST_UTF8_bak.txt' IGNORE INTO TABLE ASSP_SIS_PAYRES_IMP_BAK CHARACTER SET UTF8MB4 IGNORE 0 LINES  (@row) SET `D_NAME` = NULLIF(TRIM(CONVERT(UNHEX(SUBSTR(HEX(@row), 1,240)) USING UTF8MB4)),'');Query OK, 2 rows affected (0.01 sec)Records: 2  Deleted: 0  Skipped: 0  Warnings: 0
greatsql> LOAD DATA LOCAL INFILE '/home/greatdb/TEST_GB18030_bak.txt' IGNORE INTO TABLE ASSP_SIS_PAYRES_IMP_BAK CHARACTER SET GB18030 IGNORE 0 LINES (@row) SET `D_NAME` = NULLIF(TRIM(CONVERT(UNHEX(SUBSTR(HEX(@row), 1,240)) USING GB18030)),'');Query OK, 2 rows affected (0.01 sec)Records: 2 Deleted: 0 Skipped: 0 Warnings: 0
复制代码


2)查看表数据


+----------+------------------------------------------------------+| AUTO_INC | D_NAME                                               |+----------+------------------------------------------------------+|        1 | xxx社会保险xxx                                        ||        2 | xxx市路桥区xxx                                        ||        4 | 鍙板窞甯傝矾妗ュ尯绀句細淇濋櫓浜嬩笟绠$悊涓績             ||        5 | 鍙板窞甯傝矾妗ュ尯绀句細淇濋櫓浜嬩笟绠$悊涓績             |+----------+------------------------------------------------------+4 rows in set (0.00 sec)
复制代码


3)检查业务表的字符集与校验集,发现字符集为 utf8mb4 、校验集为 utf8mb4_bin


4)检查数据库的字符集与校验集


greatsql> SHOW GLOBAL VARIABLES LIKE '%char%';+--------------------------------------+--------------------------------+| Variable_name                        | Value                          |+--------------------------------------+--------------------------------+| character_set_client                 | utf8mb4                        || character_set_connection             | utf8mb4                        || character_set_database               | utf8mb4                        || character_set_filesystem             | binary                         || character_set_results                | utf8mb4                        || character_set_server                 | utf8mb4                        || character_set_system                 | utf8mb3                        || character_sets_dir                   | /opt/mysql3301/share/charsets/ || validate_password_special_char_count | 1                              |+--------------------------------------+--------------------------------+9 rows in set (0.01 sec)
greatsql> SHOW GLOBAL VARIABLES LIKE '%coll%';+-------------------------------+--------------------+| Variable_name | Value |+-------------------------------+--------------------+| collation_connection | utf8mb4_bin || collation_database | utf8mb4_bin || collation_server | utf8mb4_bin || default_collation_for_utf8mb4 | utf8mb4_general_ci |+-------------------------------+--------------------+4 rows in set (0.00 sec)
复制代码


程序在 MySQL 5.7.21 跑了很长时间,一直没有问题,把数据库升级 MySQL 8.0.25 后,新导入的数据出现部分乱码, 由此怀疑,MySQL 8.0 定长数据导入LOAD DATA @row 出现 BUG。


BUG 场景:同一个会话 LOAD DATA多种字符集文件,使用 @临时变量切割字段。将导致导入数据乱码,向 MySQL 官方提 BUG,已证实为 BUG(编号 115824)

问题复现

MySQL: 8.0.25


greatsql> SELECT VERSION();+-----------+| version() |+-----------+| 8.0.25    |+-----------+1 row in set (0.00 sec)

table ddl:CREATE TABLE `assp_sis_payres_imp_bak` ( `AUTO_INC` bigint unsigned NOT NULL AUTO_INCREMENT COMMENT '自增列', `D_NAME` varchar(210) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin DEFAULT NULL, PRIMARY KEY (`AUTO_INC`)) ENGINE=InnoDB AUTO_INCREMENT=7 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin;

greatsql> SHOW GLOBAL VARIABLES LIKE '%char%';+--------------------------------------+--------------------------------+| Variable_name | Value |+--------------------------------------+--------------------------------+| character_set_client | utf8mb4 || character_set_connection | utf8mb4 || character_set_database | utf8mb4 || character_set_filesystem | binary || character_set_results | utf8mb4 || character_set_server | utf8mb4 || character_set_system | utf8mb3 || character_sets_dir | /opt/mysql3301/share/charsets/ || validate_password_special_char_count | 1 |+--------------------------------------+--------------------------------+9 rows in set (0.01 sec)
greatsql> SHOW GLOBAL VARIABLES LIKE '%coll%';+-------------------------------+--------------------+| Variable_name | Value |+-------------------------------+--------------------+| collation_connection | utf8mb4_bin || collation_database | utf8mb4_bin || collation_server | utf8mb4_bin || default_collation_for_utf8mb4 | utf8mb4_general_ci |+-------------------------------+--------------------+4 rows in set (0.00 sec)
greatsql> TRUNCATE TABLE assp_sis_payres_imp_bak;Query OK, 0 rows affected (0.03 sec)
greatsql> SELECT charset(@row), @row;+---------------+------------+| charset(@row) | @row |+---------------+------------+| binary | NULL |+---------------+------------+1 row in set (0.00 sec)
greatsql> LOAD DATA LOCAL INFILE '/root/dba_zc/load/TEST_UTF8_bak.txt' IGNORE INTO TABLE ASSP_SIS_PAYRES_IMP_BAK CHARACTER SET UTF8MB4 IGNORE 0 LINES (@row) SET `D_NAME` = NULLIF(TRIM(CONVERT(UNHEX(SUBSTR(HEX(@row), 1,240)) USING UTF8MB4)),'');Query OK, 2 rows affected (0.01 sec)Records: 2 Deleted: 0 Skipped: 0 Warnings: 0
greatsql> SELECT charset(@row), @row; +---------------+------------------------+| charset(@row) | @row |+---------------+------------------------+| utf8mb4 | XXX路桥区社会保XXX |+---------------+------------------------+
greatsql> LOAD DATA LOCAL INFILE '/root/dba_zc/load/TEST_GB18030_bak.txt' IGNORE INTO TABLE ASSP_SIS_PAYRES_IMP_BAK CHARACTER SET GB18030 IGNORE 0 LINES (@row) SET `D_NAME` = NULLIF(TRIM(CONVERT(UNHEX(SUBSTR(HEX(@row), 1,240)) USING GB18030)),'');Query OK, 2 rows affected (0.01 sec)Records: 2 Deleted: 0 Skipped: 0 Warnings: 0
greatsql> SELECT charset(@row), @row; +---------------+-----------------------------------------+| charset(@row) | @row |+---------------+-----------------------------------------+| gb18030 | XXX路桥区社会保XXX |+---------------+-----------------------------------------+
greatsql> SELECT * FROM ASSP_SIS_PAYRES_IMP_BAK;+----------+---------------------------------------------------------+| AUTO_INC | D_NAME |+----------+---------------------------------------------------------+| 1 | XXX路桥区社会保XXX || 2 | XXX路桥区社会保XXX || 4 | 鍙板窞甯傝矾妗ュ尯绀句細淇濋櫓浜嬩笟绠$悊涓績 || 5 | 鍙板窞甯傝矾妗ュ尯绀句細淇濋櫓浜嬩笟绠$悊涓績 |+----------+---------------------------------------------------------+4 rows in set (0.00 sec)
复制代码


MySQL 5.7.21


greatsql> SELECT VERSION();+------------+| version()  |+------------+| 5.7.21-log |+------------+1 row in set (0.01 sec)
table ddl:CREATE TABLE `assp_sis_payres_imp_bak` ( `AUTO_INC` bigint unsigned NOT NULL AUTO_INCREMENT COMMENT '自增列', `D_NAME` varchar(210) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin DEFAULT NULL, PRIMARY KEY (`AUTO_INC`)) ENGINE=InnoDB AUTO_INCREMENT=7 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin;
greatsql> SHOW GLOBAL VARIABLES LIKE '%char%';+--------------------------------------+--------------------------------+| Variable_name | Value |+--------------------------------------+--------------------------------+| character_set_client | utf8mb4 || character_set_connection | utf8mb4 || character_set_database | utf8mb4 || character_set_filesystem | binary || character_set_results | utf8mb4 || character_set_server | utf8mb4 || character_set_system | utf8 || character_sets_dir | /opt/mysql3305/share/charsets/ || validate_password_special_char_count | 1 |+--------------------------------------+--------------------------------+9 rows in set (0.00 sec)
greatsql> SHOW GLOBAL VARIABLES LIKE '%coll%';+----------------------+--------------------+| Variable_name | Value |+----------------------+--------------------+| collation_connection | utf8mb4_general_ci || collation_database | utf8mb4_general_ci || collation_server | utf8mb4_general_ci |+----------------------+--------------------+3 rows in set (0.00 sec)
greatsql> SELECT charset(@row), @row;+---------------+------------+| charset(@row) | @row |+---------------+------------+| binary | NULL |+---------------+------------+1 row in set (0.00 sec)
greatsql> LOAD DATA LOCAL INFILE '/root/dba_zc/load/TEST_UTF8_bak.txt' IGNORE INTO TABLE ASSP_SIS_PAYRES_IMP_BAK CHARACTER SET UTF8MB4 IGNORE 0 LINES (@row) SET `D_NAME` = NULLIF(TRIM(CONVERT(UNHEX(SUBSTR(HEX(@row), 1,240)) USING UTF8MB4)),'');Query OK, 2 rows affected (0.01 sec)Records: 2 Deleted: 0 Skipped: 0 Warnings: 0
greatsql> SELECT charset(@row), @row; +---------------+-----------------------+| charset(@row) | @row |+---------------+-----------------------+| utf8mb4 | XXX路桥区社会保XXX |+---------------+-----------------------+
greatsql> LOAD DATA LOCAL INFILE '/root/dba_zc/load/TEST_GB18030_bak.txt' IGNORE INTO TABLE ASSP_SIS_PAYRES_IMP_BAK CHARACTER SET GB18030 IGNORE 0 LINES (@row) SET `D_NAME` = NULLIF(TRIM(CONVERT(UNHEX(SUBSTR(HEX(@row), 1,240)) USING GB18030)),'');Query OK, 2 rows affected (0.01 sec)Records: 2 Deleted: 0 Skipped: 0 Warnings: 0
greatsql> SELECT charset(@row), @row; +---------------+-----------------------+| charset(@row) | @row |+---------------+-----------------------+| gb18030 | XXX路桥区社会保XXX |+---------------+-----------------------+
greatsql> SELECT * FROM ASSP_SIS_PAYRES_IMP_BAK; +---------------+-----------------------------+| AUTO_INC | D_NAME |+---------------+-----------------------------+| 1 | XXX路桥区社会保XXX || 2 | XXX路桥区社会保XXX || 4 | XXX路桥区社会保XXX || 5 | XXX路桥区社会保XXX |+---------------+-----------------------------+4 rows in set (0.00 sec)
复制代码


BUG 规避方案


通过SELECT``charset(@row), @row; 可以看到@row在执行LOAD DATA后在 5.7.21 和 8.0.25 是一样的,但最终的影响不一样。虽然 MySQL 官方确认此问题为 BUG,但没有提供规避方案或者解决方案。通过万里工程师研究后,发现一种可行的规避方案。每次执行LOAD DATA命令前执行 [set @row=_binary'';] 进行规避。


greatsql> SELECT VERSION();+-----------+| version() |+-----------+| 8.0.25    |+-----------+1 row in set (0.00 sec)
greatsql> SET @row=_binary'';Query OK, 0 rows affected (0.00 sec)
greatsql> LOAD DATA LOCAL INFILE '/home/greatdb/TEST_UTF8_bak.txt' IGNORE INTO TABLE ASSP_SIS_PAYRES_IMP_BAK CHARACTER SET UTF8MB4 IGNORE 0 LINES (@row) SET `D_NAME` = NULLIF(TRIM(CONVERT(UNHEX(SUBSTR(HEX(@row), 1,240)) USING UTF8MB4)),'');Query OK, 2 rows affected (0.01 sec)Records: 2 Deleted: 0 Skipped: 0 Warnings: 0
greatsql> SET @row=_binary'';Query OK, 0 rows affected (0.00 sec)
greatsql> LOAD DATA LOCAL INFILE '/home/greatdb/TEST_GB18030_bak.txt' IGNORE INTO TABLE ASSP_SIS_PAYRES_IMP_BAK CHARACTER SET GB18030 IGNORE 0 LINES (@row) SET `D_NAME` = NULLIF(TRIM(CONVERT(UNHEX(SUBSTR(HEX(@row), 1,240)) USING GB18030)),'');Query OK, 2 rows affected (0.01 sec)Records: 2 Deleted: 0 Skipped: 0 Warnings: 0
greatsql> SELECT * FROM assp_sis_payres_imp_bak;+----------+--------------------------------------------------+| AUTO_INC | D_NAME |+----------+--------------------------------------------------+| 1 | XXX路桥区社会保XXX || 2 | XXX路桥区社会保XXX || 4 | XXX路桥区社会保XXX || 5 | XXX路桥区社会保XXX |+----------+--------------------------------------------------+4 rows in set (0.00 sec)
复制代码

问题总结

1.BUG 原因


MySQL8.0 重构定长数据导入LOAD DATA @row 出现 BUG.同一个数据库会话,多次执行LOAD DATA @row命令,则第 n 次执行LOAD DATA @row 的字符集使用的是 n-1 次的字符集,当文件的字符集存在不同,例如先后处理 GB18030、UTF8 字符集的文件就会数据乱码。此问题 MySQL 官方已证实为 BUG(编号 115824)


2.BUG 触发条件


触发条件:需同时满足以下三个条件才会触发此 bug。


1)LOAD DATA命令使用类似 @row 临时变量 进行数据处理,例如对定长记录按字节切割出多个字段:


LINES (@row) SET COLUMN_NAME = NULLIF(TRIM(CONVERT(UNHEX(SUBSTR(HEX(@row),1,20)) USING GB18030))。
复制代码


2)在同一个连接中,多次执行LOAD DATA命令,且先后处理的文件字符集存在不同(例如 GB18030 和 UTF8)。


3)使用 MySQL 8.0。


3.BUG 规避办法


由万里工程师提出,与 MySQL 官方社区沟通证实,涉及到满足上述 BUG 触发条件的场景,通过在每次执行LOAD DATA命令前执行 [set @row=_binary'';] 进行规避。


参考:https://bugs.mysql.com/bug.php?id=115824


发布于: 刚刚阅读数: 4
用户头像

GreatSQL

关注

GreatSQL社区 2023-01-31 加入

GreatSQL是由万里数据库维护的MySQL分支,专注于提升MGR可靠性及性能,支持InnoDB并行查询特性,是适用于金融级应用的MySQL分支版本。 社区:https://greatsql.cn/ Gitee: https://gitee.com/GreatSQL/GreatSQL

评论

发布
暂无评论
5.7 与 8.0 对相同文件的 LOAD DATA 语句结果不同_数据库_GreatSQL_InfoQ写作社区