utf8mb4 默认排序规则引起的索引失效
- 2024-07-19 北京
本文字数:8083 字
阅读完需:约 27 分钟
作者: TNTT 原文来源:https://tidb.net/blog/5b8e4207
现象
SQL:
SELECT
GROUP_CONCAT(DISTINCT label_name)
FROM
db.`hfr` AS d
WHERE
d.source_id IN (
SELECT
CAST(b.id AS CHAR) AS source_id1
FROM
db.pcr a
INNER JOIN db.pqo b ON a.call_id = b.quality_data_id
WHERE
a.created_at > '2024-04-25'
AND b.quality_data_source IN ('gs', 'dy')
AND b.uid = '77778901-dddd-aaaa-xxxx-112233445566'
UNION
ALL
SELECT
CAST(b.id AS CHAR) AS source_id1
FROM
db.wpcr a
INNER JOIN db.pqo b ON a.external_call_id = b.quality_data_id
WHERE
a.created_at > '2024-04-25'
AND b.quality_data_source IN ('wP')
AND b.uid = '77778901-dddd-aaaa-xxxx-112233445566'
);
库结构:
CREATE DATABASE `xx` /*!40100 DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci */
hfr 表结构
CREATE TABLE `hfr` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT COMMENT 'xx',
`xx` bigint(20) NOT NULL COMMENT 'xx,
`source_id` varchar(50) COLLATE utf8mb4_general_ci NOT NULL COMMENT 'xx',
`xx` bigint(20) NOT NULL COMMENT 'xx',
....
PRIMARY KEY (`id`) /*T![clustered_index] CLUSTERED */,
KEY `idx_created_at` (`created_at`),
....
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_general_ci AUTO_INCREMENT=28906499 COMMENT='xx'
pqo 表结构
CREATE TABLE `pqo` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT COMMENT 'xx',
`action_serial` varchar(50) COLLATE utf8mb4_general_ci NOT NULL COMMENT 'xx',
....
PRIMARY KEY (`id`) /*T![clustered_index] CLUSTERED */,
KEY `idx_created_at` (`created_at`),
KEY `idx_uid` (`uid`),
....
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_general_ci AUTO_INCREMENT=29489558 COMMENT='xx'
应用程序执行计划:
| id | estRows | estCost | actRows | task | access object |
| Limit_31 | 1.00 | 2121649317.57 | 1 | root | |
| └─HashAgg_32 | 1.00 | 2121649317.57 | 1 | root | |
| └─HashJoin_34 | 23108683.20 | 1660398473.28 | 19 | root | |
| ├─Union_37(Build) | 12.22 | 14440.97 | 31 | root | |
| │ ├─Projection_38 | 11.44 | 39233.90 | 31 | root | |
| │ │ └─IndexJoin_46 | 11.44 | 39118.62 | 31 | root | |
| │ │ ├─IndexLookUp_91(Build) | 11.44 | 31026.15 | 35 | root | |
| │ │ │ ├─Selection_89(Build) | 15.69 | 5592.81 | 50 | cop[tikv] | |
| │ │ │ │ └─IndexRangeScan_87 | 19.62 | 4614.01 | 50 | cop[tikv] | table:b, index:idx_uid(uid) |
| │ │ │ └─Selection_90(Probe) | 11.44 | 7577.80 | 35 | cop[tikv] | |
| │ │ │ └─TableRowIDScan_88 | 15.69 | 6794.76 | 50 | cop[tikv] | table:b |
| │ │ └─IndexLookUp_45(Probe) | 11.44 | 1969.09 | 31 | root | |
| │ │ ├─Selection_43(Build) | 11.44 | 281.86 | 35 | cop[tikv] | |
| │ │ │ └─IndexRangeScan_41 | 11.44 | 231.96 | 35 | cop[tikv] | table:a, index:uk_call_id(call_id) |
| │ │ └─Selection_44(Probe) | 11.44 | 473.82 | 31 | cop[tikv] | |
| │ │ └─TableRowIDScan_42 | 11.44 | 423.92 | 35 | cop[tikv] | table:a |
| │ └─Projection_103 | 0.79 | 32970.97 | 0 | root | |
| │ └─IndexJoin_111 | 0.79 | 32963.05 | 0 | root | |
| │ ├─IndexLookUp_156(Build) | 0.76 | 31026.15 | 2 | root | |
| │ │ ├─Selection_154(Build) | 15.69 | 5592.81 | 50 | cop[tikv] | |
| │ │ │ └─IndexRangeScan_152 | 19.62 | 4614.01 | 50 | cop[tikv] | table:b, index:idx_uid(uid) |
| │ │ └─Selection_155(Probe) | 0.76 | 7577.80 | 2 | cop[tikv] | |
| │ │ └─TableRowIDScan_153 | 15.69 | 6794.76 | 50 | cop[tikv] | table:b |
| │ └─IndexLookUp_110(Probe) | 0.76 | 1968.65 | 0 | root | |
| │ ├─Selection_108(Build) | 0.76 | 280.92 | 2 | cop[tikv] | |
| │ │ └─IndexRangeScan_106 | 0.76 | 231.02 | 2 | cop[tikv] | table:a, index:idx_external_call_id(external_call_id) |
| │ └─Selection_109(Probe) | 0.76 | 464.73 | 0 | cop[tikv] | |
| │ └─TableRowIDScan_107 | 0.76 | 414.83 | 2 | cop[tikv] | table:a |
| └─TableReader_36(Probe) | 28885854.00 | 1083819620.83 | 28864271 | root | |
| └─TableFullScan_35 | 28885854.00 | 11843062093.22 | 28864271 | cop[tikv] | table:d |
客户端执行计划:
+-----------------------------------------------+---------+---------+-----------+-------------------------------------------------------+
| id | estRows | actRows | task | access object |
+-----------------------------------------------+---------+---------+-----------+-------------------------------------------------------+
| StreamAgg_30 | 1.00 | 1 | root | |
| └─IndexHashJoin_186 | 36.86 | 19 | root | |
| ├─HashAgg_46(Build) | 12.44 | 31 | root | |
| │ └─Union_47 | 12.22 | 31 | root | |
| │ ├─Projection_48 | 11.44 | 31 | root | |
| │ │ └─IndexJoin_56 | 11.44 | 31 | root | |
| │ │ ├─IndexLookUp_101(Build) | 11.44 | 35 | root | |
| │ │ │ ├─Selection_99(Build) | 15.69 | 50 | cop[tikv] | |
| │ │ │ │ └─IndexRangeScan_97 | 19.62 | 50 | cop[tikv] | table:b, index:idx_uid(uid) |
| │ │ │ └─Selection_100(Probe) | 11.44 | 35 | cop[tikv] | |
| │ │ │ └─TableRowIDScan_98 | 15.69 | 50 | cop[tikv] | table:b |
| │ │ └─IndexLookUp_55(Probe) | 11.44 | 31 | root | |
| │ │ ├─Selection_53(Build) | 11.44 | 35 | cop[tikv] | |
| │ │ │ └─IndexRangeScan_51 | 11.44 | 35 | cop[tikv] | table:a, index:uk_call_id(call_id) |
| │ │ └─Selection_54(Probe) | 11.44 | 31 | cop[tikv] | |
| │ │ └─TableRowIDScan_52 | 11.44 | 35 | cop[tikv] | table:a |
| │ └─Projection_113 | 0.79 | 0 | root | |
| │ └─IndexJoin_121 | 0.79 | 0 | root | |
| │ ├─IndexLookUp_166(Build) | 0.76 | 2 | root | |
| │ │ ├─Selection_164(Build) | 15.69 | 50 | cop[tikv] | |
| │ │ │ └─IndexRangeScan_162 | 19.62 | 50 | cop[tikv] | table:b, index:idx_uid(uid) |
| │ │ └─Selection_165(Probe) | 0.76 | 2 | cop[tikv] | |
| │ │ └─TableRowIDScan_163 | 15.69 | 50 | cop[tikv] | table:b |
| │ └─IndexLookUp_120(Probe) | 0.76 | 0 | root | |
| │ ├─Selection_118(Build) | 0.76 | 2 | cop[tikv] | |
| │ │ └─IndexRangeScan_116 | 0.76 | 2 | cop[tikv] | table:a, index:idx_external_call_id(external_call_id) |
| │ └─Selection_119(Probe) | 0.76 | 0 | cop[tikv] | |
| │ └─TableRowIDScan_117 | 0.76 | 2 | cop[tikv] | table:a |
| └─IndexLookUp_183(Probe) | 36.86 | 19 | root | |
| ├─IndexRangeScan_181(Build) | 36.86 | 19 | cop[tikv] | table:d, index:idx_source_id(source_id) |
| └─TableRowIDScan_182(Probe) | 36.86 | 19 | cop[tikv] | table:d |
执行计划显示,客户端走的是 index hash join,d 表可以走上 source_id 字段的索引。客户实际应用里走的是 hash join,d 表走的是全表扫描。通过 plan replayer 收集到客户端和应用的 plan。
发现两者的字符集和排序规则不一致
客户端
character_set_client = "utf8"
character_set_connection = "utf8"
character_set_database = "utf8mb4"
character_set_filesystem = "binary"
character_set_results = "utf8"
character_set_server = "utf8mb4"
character_set_system = "utf8"
character_sets_dir = "/usr/local/mysql-5.6.25-osx10.8-x86_64/share/charsets/"
check_proxy_users = "OFF"
collation_connection = "utf8_general_ci"
collation_database = "utf8mb4_general_ci"
collation_server = "utf8mb4_general_ci"
应用
character_set_client = "utf8mb4"
character_set_connection = "utf8mb4"
character_set_database = "utf8mb4"
character_set_filesystem = "binary"
character_set_results = ""
character_set_server = "utf8mb4"
character_set_system = "utf8"
character_sets_dir = "/usr/local/mysql-5.6.25-osx10.8-x86_64/share/charsets/"
check_proxy_users = "OFF"
collation_connection = "utf8mb4_bin"
collation_database = "utf8mb4_general_ci"
collation_server = "utf8mb4_general_ci"
实际环境里的配置:
可以看见全局的 collation_connection=utf8mb4_general_ci,和应用、客户端的都不一致,基本可以判定是 client 侧的行为导致的不一致。
结论
从 MySQL Connector/J session 可以看到如果 characterEncoding 和 connectionCollation 没有指定,8.0.26 之前的 jdbc 版本使用服务端的默认字符集和这个字符集的默认排序规则,客户使用 TiDB 7.1.3,也就是 utf8mb4 和 utf8mb4_bin ( 但是在 MySQL 5.7 之前默认是 utf8mb4_general_ci,8.0 之后默认是 utf8mb4_0900_ai_ci)
characterEncoding
Instructs the server to set session system variables 'character_set_client' and 'character_set_connection' to the default character set supported by MySQL for the specified Java character encoding and set 'collation_connection' to the default collation for this character set. If neither this property nor the property 'connectionCollation' is set:
For Connector/J 8.0.25 and earlier, the driver will try to use the server's default character set;
For Connector/J 8.0.26 and later, the driver will use "utf8mb4".
connectionCollation
Instructs the server to set session system variable 'collation_connection' to the specified collation name and set 'character_set_client' and 'character_set_connection' to a corresponding character set. This property overrides the value of 'characterEncoding' with the default character set this collation belongs to, if and only if 'characterEncoding' is not configured or is configured with a character set that is incompatible with the collation. That means 'connectionCollation' may not always correct a mismatch of character sets. For example, if 'connectionCollation' is set to "latin1_swedish_ci", the corresponding character set is "latin1" for MySQL, which maps it to the Java character set "windows-1252"; so if 'characterEncoding' is not set,"windows-1252" is the character set that will be used; but if 'characterEncoding' has been set to, e.g. "ISO-8859-1", that is compatible with "latin1_swedish_ci", so the character encoding setting is left unchanged; and if client is actually using "windows-1252" (which is similar but different from "ISO-8859-1"), errors would occur for some characters. If neither this property nor the property 'characterEncoding' is set:
For Connector/J 8.0.25 and earlier, the driver will try to use the server's default character set;
For Connector/J 8.0.26 and later, the driver will use utf8mb4's default collation.
至于为什么不同的排序规则导致的执行计划异常,原因是 hfr 的 source_id varchar(50) 在和 pqo 的 cast(id bigint as char) 关联时候,两者的排序不一致:source_id 是 utf8mb4_general_ci(字段定义),cast(id as char) 是字符串 utf8mb4 排序是默认的 utf8mb4_bin,
因为排序规则不一样导致无法走索引,所以用了 hash join。
在 7.4 版本之前有两种办法解决:
设置 session collation_connection=utf8mb4_general_ci,通过 jdbc 里配置 sessionVariables=collation_connection=utf8mb4_general_ci
修改 SQL 将 cast (id as char) 指定 collation: CAST(b.id AS CHAR) COLLATE utf8mb4_general_ci AS source_id1
在 7.4 版本之后提供了参数 default_collation_for_utf8mb4 来指定 utf8mb4 的默认排序规则
作用域:GLOBAL | SESSION
是否持久化到集群:是
是否受 Hint SET_VAR 控制:否
类型:字符串
默认值:
utf8mb4_bin
可选值:
utf8mb4_bin
、utf8mb4_general_ci
、utf8mb4_0900_ai_ci
该变量用于设置 utf8mb4 字符集的默认排序规则。它会影响以下语句的行为:
SHOW COLLATION
和SHOW CHARACTER SET
语句显示的默认排序规则。CREATE TABLE
和ALTER TABLE
语句中对表或列使用CHARACTER SET
语法明确指定 utf8mb4 字符集而未指定排序规则时,将使用该变量指定的排序规则。不影响未使用CHARACTER SET
语法时的行为。CREATE DATABASE
和ALTER DATABASE
语句中使用CHARACTER SET
语法明确指定 utf8mb4 字符集而未指定排序规则时,将使用该变量指定的排序规则。不影响未使用CHARACTER SET
语法时的行为。任何使用
_utf8mb4'string'
形式的字面量在未使用COLLATE
语法指定排序规则时,将使用该变量指定的排序规则。
版权声明: 本文为 InfoQ 作者【TiDB 社区干货传送门】的原创文章。
原文链接:【http://xie.infoq.cn/article/4a12a2c9a5ad2f46e7b15df6b】。文章转载请联系作者。
TiDB 社区干货传送门
TiDB 社区官网:https://tidb.net/ 2021-12-15 加入
TiDB 社区干货传送门是由 TiDB 社区中布道师组委会自发组织的 TiDB 社区优质内容对外宣布的栏目,旨在加深 TiDBer 之间的交流和学习。一起构建有爱、互助、共创共建的 TiDB 社区 https://tidb.net/
评论