写点什么

utf8mb4 默认排序规则引起的索引失效

  • 2024-07-19
    北京
  • 本文字数:8083 字

    阅读完需:约 27 分钟

作者: TNTT 原文来源:https://tidb.net/blog/5b8e4207

现象

SQL:


SELECT  GROUP_CONCAT(DISTINCT label_name)FROM  db.`hfr` AS dWHERE  d.source_id IN (    SELECT      CAST(b.id AS CHAR) AS source_id1    FROM      db.pcr a      INNER JOIN db.pqo b ON a.call_id = b.quality_data_id    WHERE      a.created_at > '2024-04-25'      AND b.quality_data_source IN ('gs', 'dy')      AND b.uid = '77778901-dddd-aaaa-xxxx-112233445566'    UNION    ALL    SELECT      CAST(b.id AS CHAR) AS source_id1    FROM      db.wpcr a      INNER JOIN db.pqo b ON a.external_call_id = b.quality_data_id    WHERE      a.created_at > '2024-04-25'      AND b.quality_data_source IN ('wP')      AND b.uid = '77778901-dddd-aaaa-xxxx-112233445566'  );
复制代码


库结构:


 CREATE DATABASE `xx` /*!40100 DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci */
复制代码

hfr 表结构

CREATE TABLE `hfr` (  `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT COMMENT 'xx',  `xx` bigint(20) NOT NULL COMMENT 'xx,  `source_id` varchar(50) COLLATE utf8mb4_general_ci NOT NULL COMMENT 'xx',  `xx` bigint(20) NOT NULL COMMENT 'xx',  ....  PRIMARY KEY (`id`) /*T![clustered_index] CLUSTERED */,  KEY `idx_created_at` (`created_at`),  ....) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_general_ci AUTO_INCREMENT=28906499 COMMENT='xx'
复制代码

pqo 表结构

CREATE TABLE `pqo` (  `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT COMMENT 'xx',  `action_serial` varchar(50) COLLATE utf8mb4_general_ci NOT NULL COMMENT 'xx',  ....  PRIMARY KEY (`id`) /*T![clustered_index] CLUSTERED */,  KEY `idx_created_at` (`created_at`),  KEY `idx_uid` (`uid`),  ....) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_general_ci AUTO_INCREMENT=29489558 COMMENT='xx'
复制代码


应用程序执行计划:


| id                                 | estRows     | estCost        | actRows  | task      | access object                                         || Limit_31                           | 1.00        | 2121649317.57  | 1        | root      |                                                       || └─HashAgg_32                       | 1.00        | 2121649317.57  | 1        | root      |                                                       ||   └─HashJoin_34                    | 23108683.20 | 1660398473.28  | 19       | root      |                                                       ||     ├─Union_37(Build)              | 12.22       | 14440.97       | 31       | root      |                                                       ||     │ ├─Projection_38              | 11.44       | 39233.90       | 31       | root      |                                                       ||     │ │ └─IndexJoin_46             | 11.44       | 39118.62       | 31       | root      |                                                       ||     │ │   ├─IndexLookUp_91(Build)  | 11.44       | 31026.15       | 35       | root      |                                                       ||     │ │   │ ├─Selection_89(Build)  | 15.69       | 5592.81        | 50       | cop[tikv] |                                                       ||     │ │   │ │ └─IndexRangeScan_87  | 19.62       | 4614.01        | 50       | cop[tikv] | table:b, index:idx_uid(uid)                           ||     │ │   │ └─Selection_90(Probe)  | 11.44       | 7577.80        | 35       | cop[tikv] |                                                       ||     │ │   │   └─TableRowIDScan_88  | 15.69       | 6794.76        | 50       | cop[tikv] | table:b                                               ||     │ │   └─IndexLookUp_45(Probe)  | 11.44       | 1969.09        | 31       | root      |                                                       ||     │ │     ├─Selection_43(Build)  | 11.44       | 281.86         | 35       | cop[tikv] |                                                       ||     │ │     │ └─IndexRangeScan_41  | 11.44       | 231.96         | 35       | cop[tikv] | table:a, index:uk_call_id(call_id)                    ||     │ │     └─Selection_44(Probe)  | 11.44       | 473.82         | 31       | cop[tikv] |                                                       ||     │ │       └─TableRowIDScan_42  | 11.44       | 423.92         | 35       | cop[tikv] | table:a                                               ||     │ └─Projection_103             | 0.79        | 32970.97       | 0        | root      |                                                       ||     │   └─IndexJoin_111            | 0.79        | 32963.05       | 0        | root      |                                                       ||     │     ├─IndexLookUp_156(Build) | 0.76        | 31026.15       | 2        | root      |                                                       ||     │     │ ├─Selection_154(Build) | 15.69       | 5592.81        | 50       | cop[tikv] |                                                       ||     │     │ │ └─IndexRangeScan_152 | 19.62       | 4614.01        | 50       | cop[tikv] | table:b, index:idx_uid(uid)                           ||     │     │ └─Selection_155(Probe) | 0.76        | 7577.80        | 2        | cop[tikv] |                                                       ||     │     │   └─TableRowIDScan_153 | 15.69       | 6794.76        | 50       | cop[tikv] | table:b                                               ||     │     └─IndexLookUp_110(Probe) | 0.76        | 1968.65        | 0        | root      |                                                       ||     │       ├─Selection_108(Build) | 0.76        | 280.92         | 2        | cop[tikv] |                                                       ||     │       │ └─IndexRangeScan_106 | 0.76        | 231.02         | 2        | cop[tikv] | table:a, index:idx_external_call_id(external_call_id) ||     │       └─Selection_109(Probe) | 0.76        | 464.73         | 0        | cop[tikv] |                                                       ||     │         └─TableRowIDScan_107 | 0.76        | 414.83         | 2        | cop[tikv] | table:a                                               ||     └─TableReader_36(Probe)        | 28885854.00 | 1083819620.83  | 28864271 | root      |                                                       ||       └─TableFullScan_35           | 28885854.00 | 11843062093.22 | 28864271 | cop[tikv] | table:d                                               |
复制代码


客户端执行计划:


+-----------------------------------------------+---------+---------+-----------+-------------------------------------------------------+| id                                            | estRows | actRows | task      | access object                                         |+-----------------------------------------------+---------+---------+-----------+-------------------------------------------------------+| StreamAgg_30                                  | 1.00    | 1       | root      |                                                       || └─IndexHashJoin_186                           | 36.86   | 19      | root      |                                                       ||   ├─HashAgg_46(Build)                         | 12.44   | 31      | root      |                                                       ||   │ └─Union_47                                | 12.22   | 31      | root      |                                                       ||   │   ├─Projection_48                         | 11.44   | 31      | root      |                                                       ||   │   │ └─IndexJoin_56                        | 11.44   | 31      | root      |                                                       ||   │   │   ├─IndexLookUp_101(Build)            | 11.44   | 35      | root      |                                                       ||   │   │   │ ├─Selection_99(Build)             | 15.69   | 50      | cop[tikv] |                                                       ||   │   │   │ │ └─IndexRangeScan_97             | 19.62   | 50      | cop[tikv] | table:b, index:idx_uid(uid)                           ||   │   │   │ └─Selection_100(Probe)            | 11.44   | 35      | cop[tikv] |                                                       ||   │   │   │   └─TableRowIDScan_98             | 15.69   | 50      | cop[tikv] | table:b                                               ||   │   │   └─IndexLookUp_55(Probe)             | 11.44   | 31      | root      |                                                       ||   │   │     ├─Selection_53(Build)             | 11.44   | 35      | cop[tikv] |                                                       ||   │   │     │ └─IndexRangeScan_51             | 11.44   | 35      | cop[tikv] | table:a, index:uk_call_id(call_id)                    ||   │   │     └─Selection_54(Probe)             | 11.44   | 31      | cop[tikv] |                                                       ||   │   │       └─TableRowIDScan_52             | 11.44   | 35      | cop[tikv] | table:a                                               ||   │   └─Projection_113                        | 0.79    | 0       | root      |                                                       ||   │     └─IndexJoin_121                       | 0.79    | 0       | root      |                                                       ||   │       ├─IndexLookUp_166(Build)            | 0.76    | 2       | root      |                                                       ||   │       │ ├─Selection_164(Build)            | 15.69   | 50      | cop[tikv] |                                                       ||   │       │ │ └─IndexRangeScan_162            | 19.62   | 50      | cop[tikv] | table:b, index:idx_uid(uid)                           ||   │       │ └─Selection_165(Probe)            | 0.76    | 2       | cop[tikv] |                                                       ||   │       │   └─TableRowIDScan_163            | 15.69   | 50      | cop[tikv] | table:b                                               ||   │       └─IndexLookUp_120(Probe)            | 0.76    | 0       | root      |                                                       ||   │         ├─Selection_118(Build)            | 0.76    | 2       | cop[tikv] |                                                       ||   │         │ └─IndexRangeScan_116            | 0.76    | 2       | cop[tikv] | table:a, index:idx_external_call_id(external_call_id) ||   │         └─Selection_119(Probe)            | 0.76    | 0       | cop[tikv] |                                                       ||   │           └─TableRowIDScan_117            | 0.76    | 2       | cop[tikv] | table:a                                               ||   └─IndexLookUp_183(Probe)                    | 36.86   | 19      | root      |                                                       ||     ├─IndexRangeScan_181(Build)               | 36.86   | 19      | cop[tikv] | table:d, index:idx_source_id(source_id)               ||     └─TableRowIDScan_182(Probe)               | 36.86   | 19      | cop[tikv] | table:d                                               |
复制代码


执行计划显示,客户端走的是 index hash join,d 表可以走上 source_id 字段的索引。客户实际应用里走的是 hash join,d 表走的是全表扫描。通过 plan replayer 收集到客户端和应用的 plan。


发现两者的字符集和排序规则不一致


客户端


character_set_client = "utf8"character_set_connection = "utf8"character_set_database = "utf8mb4"character_set_filesystem = "binary"character_set_results = "utf8"character_set_server = "utf8mb4"character_set_system = "utf8"character_sets_dir = "/usr/local/mysql-5.6.25-osx10.8-x86_64/share/charsets/"check_proxy_users = "OFF"collation_connection = "utf8_general_ci"collation_database = "utf8mb4_general_ci"collation_server = "utf8mb4_general_ci"
复制代码


应用


character_set_client = "utf8mb4"character_set_connection = "utf8mb4"character_set_database = "utf8mb4"character_set_filesystem = "binary"character_set_results = ""character_set_server = "utf8mb4"character_set_system = "utf8"character_sets_dir = "/usr/local/mysql-5.6.25-osx10.8-x86_64/share/charsets/"check_proxy_users = "OFF"collation_connection = "utf8mb4_bin"collation_database = "utf8mb4_general_ci"collation_server = "utf8mb4_general_ci"
复制代码


实际环境里的配置:



可以看见全局的 collation_connection=utf8mb4_general_ci,和应用、客户端的都不一致,基本可以判定是 client 侧的行为导致的不一致。

结论

MySQL Connector/J session 可以看到如果 characterEncoding 和 connectionCollation 没有指定,8.0.26 之前的 jdbc 版本使用服务端的默认字符集和这个字符集的默认排序规则,客户使用 TiDB 7.1.3,也就是 utf8mb4 和 utf8mb4_bin ( 但是在 MySQL 5.7 之前默认是 utf8mb4_general_ci,8.0 之后默认是 utf8mb4_0900_ai_ci)


characterEncodingInstructs the server to set session system variables 'character_set_client' and 'character_set_connection' to the default character set supported by MySQL for the specified Java character encoding and set 'collation_connection' to the default collation for this character set. If neither this property nor the property 'connectionCollation' is set:For Connector/J 8.0.25 and earlier, the driver will try to use the server's default character set;For Connector/J 8.0.26 and later, the driver will use "utf8mb4".connectionCollationInstructs the server to set session system variable 'collation_connection' to the specified collation name and set 'character_set_client' and 'character_set_connection' to a corresponding character set. This property overrides the value of 'characterEncoding' with the default character set this collation belongs to, if and only if 'characterEncoding' is not configured or is configured with a character set that is incompatible with the collation. That means 'connectionCollation' may not always correct a mismatch of character sets. For example, if 'connectionCollation' is set to "latin1_swedish_ci", the corresponding character set is "latin1" for MySQL, which maps it to the Java character set "windows-1252"; so if 'characterEncoding' is not set,"windows-1252" is the character set that will be used; but if 'characterEncoding' has been set to, e.g. "ISO-8859-1", that is compatible with "latin1_swedish_ci", so the character encoding setting is left unchanged; and if client is actually using "windows-1252" (which is similar but different from "ISO-8859-1"), errors would occur for some characters. If neither this property nor the property 'characterEncoding' is set:For Connector/J 8.0.25 and earlier, the driver will try to use the server's default character set;For Connector/J 8.0.26 and later, the driver will use utf8mb4's default collation.
复制代码


至于为什么不同的排序规则导致的执行计划异常,原因是 hfr 的 source_id varchar(50) 在和 pqo 的 cast(id bigint as char) 关联时候,两者的排序不一致:source_id 是 utf8mb4_general_ci(字段定义),cast(id as char) 是字符串 utf8mb4 排序是默认的 utf8mb4_bin,


因为排序规则不一样导致无法走索引,所以用了 hash join。


在 7.4 版本之前有两种办法解决:


  1. 设置 session collation_connection=utf8mb4_general_ci,通过 jdbc 里配置 sessionVariables=collation_connection=utf8mb4_general_ci

  2. 修改 SQL 将 cast (id as char) 指定 collation: CAST(b.id AS CHAR) COLLATE utf8mb4_general_ci AS source_id1


在 7.4 版本之后提供了参数 default_collation_for_utf8mb4 来指定 utf8mb4 的默认排序规则


  • 作用域:GLOBAL | SESSION

  • 是否持久化到集群:是

  • 是否受 Hint SET_VAR 控制:否

  • 类型:字符串

  • 默认值:utf8mb4_bin

  • 可选值:utf8mb4_binutf8mb4_general_ciutf8mb4_0900_ai_ci

  • 该变量用于设置 utf8mb4 字符集的默认排序规则。它会影响以下语句的行为:

  • SHOW COLLATIONSHOW CHARACTER SET 语句显示的默认排序规则。

  • CREATE TABLEALTER TABLE 语句中对表或列使用 CHARACTER SET 语法明确指定 utf8mb4 字符集而未指定排序规则时,将使用该变量指定的排序规则。不影响未使用 CHARACTER SET 语法时的行为。

  • CREATE DATABASEALTER DATABASE 语句中使用 CHARACTER SET 语法明确指定 utf8mb4 字符集而未指定排序规则时,将使用该变量指定的排序规则。不影响未使用 CHARACTER SET 语法时的行为。

  • 任何使用 _utf8mb4'string' 形式的字面量在未使用 COLLATE 语法指定排序规则时,将使用该变量指定的排序规则。


发布于: 刚刚阅读数: 3
用户头像

TiDB 社区官网:https://tidb.net/ 2021-12-15 加入

TiDB 社区干货传送门是由 TiDB 社区中布道师组委会自发组织的 TiDB 社区优质内容对外宣布的栏目,旨在加深 TiDBer 之间的交流和学习。一起构建有爱、互助、共创共建的 TiDB 社区 https://tidb.net/

评论

发布
暂无评论
utf8mb4 默认排序规则引起的索引失效_应用适配_TiDB 社区干货传送门_InfoQ写作社区