写点什么

【GreatSQL 优化器 -14】直方图应用

作者:GreatSQL
  • 2025-02-19
    福建
  • 本文字数:7098 字

    阅读完需:约 23 分钟

【GreatSQL 优化器-14】直方图应用

一、直方图介绍

GreatSQL 的优化器负责将 SQL 查询转换为尽可能高效的执行计划,但因为数据环境不断变化有可能导致优化器对查询数据了解不够充足,可能无法生成最优的执行计划进而影响查询效率,因此推出了直方图(histogram)功能来解决该问题。


直方图用于统计字段值的分布情况,向优化器提供统计信息。利用直方图,可以对一张表的一列数据做分布统计,估算 where 条件中过滤字段的选择率,从而帮助优化器更准确地估计查询过程中的行数,选择更高效的查询计划。


下面用一个简单的例子来说明直方图怎么应用在优化器。


greatsql> CREATE TABLE t1 (c1 INT PRIMARY KEY, c2 INT,date1 DATETIME);greatsql> INSERT INTO t1 VALUES (1,10,'2021-03-25 16:44:00.123456'),(2,1,'2022-03-26 16:44:00.123456'),(3,4,'2023-03-27 16:44:00.123456'),(5,5,'2024-03-25 16:44:00.123456');greatsql> CREATE TABLE t2 (cc1 INT PRIMARY KEY, cc2 INT);greatsql> INSERT INTO t2 VALUES (1,3),(2,1),(3,2),(4,3),(5,15);greatsql> CREATE TABLE t3 (ccc1 INT, ccc2 varchar(100));greatsql> INSERT INTO t3 VALUES (1,'aa1'),(2,'bb1'),(3,'cc1'),(4,'dd1'),(null,'ee');greatsql> CREATE INDEX idx1 ON t1(c2);greatsql> CREATE INDEX idx2 ON t1(c2,date1);greatsql> CREATE INDEX idx2_1 ON t2(cc2);greatsql> CREATE INDEX idx3_1 ON t3(ccc1);
系统自动创建buckets:greatsql> ANALYZE TABLE t1 UPDATE HISTOGRAM ON c2 WITH 3 BUCKETS;greatsql> SELECT json_pretty(histogram)result FROM information_schema.column_statistics WHERE table_name = 't1';| { "buckets": [ [ 1, 5, 0.42857142857142855, 3 ], [ 10, 10, 0.7142857142857143, 1 ], [ 16, 16, 0.8571428571428571, 1 ] ], "data-type": "int", "null-values": 0.14285714285714285, "collation-id": 8, "last-updated": "2024-10-22 08:38:48.858099", "sampling-rate": 1.0, "histogram-type": "equi-height", "number-of-buckets-specified": 3}
greatsql> EXPLAIN SELECT * FROM t1 join t3 ON t1.c1=t3.ccc1 or t1.c2<5;+----+-------------+-------+------------+------+-------------------+------+---------+------+------+----------+------------------------------------------------+| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |+----+-------------+-------+------------+------+-------------------+------+---------+------+------+----------+------------------------------------------------+| 1 | SIMPLE | t3 | NULL | ALL | idx3_1 | NULL | NULL | NULL | 5 | 100.00 | NULL || 1 | SIMPLE | t1 | NULL | ALL | PRIMARY,idx1,idx2 | NULL | NULL | NULL | 7 | 43.67 | Range checked for each record (index map: 0x7) |+----+-------------+-------+------------+------+-------------------+------+---------+------+------+----------+------------------------------------------------+
"plan_prefix": [ ], "table": "`t1`", "best_access_path": { "considered_access_paths": [ { "rows_to_scan": 7, "filtering_effect": [ { "condition": "(`t1`.`c2` < 5)", 对t1.c2的过滤系数估计用到了直方图 "histogram_selectivity": 0.342857 这里过滤系数算出来为0.342857,即直方图第一个桶小于5的数据占的百分比 } ], "final_filtering_effect": 1, "access_type": "scan", "resulting_rows": 7, "cost": 0.95, "chosen": true } ] },
复制代码

二、get_histogram_selectivity 代码解释

直方图用在优化器计算过滤系数的时候,算出来的频率数据更精确。


// Item在计算get_filtering_effect的时候会用到直方图估计过滤系数static bool get_histogram_selectivity(THD *thd, const Field *field, Item **args,                                      size_t arg_count,                                      histograms::enum_operator op,                                      Item_func *item_func,                                      const TABLE_SHARE *table_share,                                      double *selectivity) {  const histograms::Histogram *histogram =      table_share->find_histogram(field->field_index());  if (histogram != nullptr) {    // 计算过滤系数    if (!histogram->get_selectivity(args, arg_count, op, selectivity)) {      return false;    }  }  return true;}
bool Histogram::get_selectivity(Item **items, size_t item_count, enum_operator op, double *selectivity) const { // 找该表的列对应的直方图,取出对应的范围的数据频率。计算方式见表一和表二 if (get_raw_selectivity(items, item_count, op, selectivity)) return true;}
复制代码


表一:等高直方图数据分布频率计算方式



表二:等宽直方图数据分布频率计算方式



表三:等高直方图 get_distance_from_upper 计算公式



表四:等高直方图 get_distance_from_lower 计算公式


三、实际例子说明

接下来看几个例子来说明上面的代码。


首先创建等高直方图,看看结果。


greatsql> ANALYZE TABLE t1 UPDATE HISTOGRAM ON c2 WITH 3 BUCKETS;greatsql> EXPLAIN SELECT * FROM t1 join t3 ON t1.c1=t3.ccc1 or t1.c2<5;+----+-------------+-------+------------+------+-------------------+------+---------+------+------+----------+------------------------------------------------+| id | select_type | table | partitions | type | possible_keys     | key  | key_len | ref  | rows | filtered | Extra                                          |+----+-------------+-------+------------+------+-------------------+------+---------+------+------+----------+------------------------------------------------+|  1 | SIMPLE      | t3    | NULL       | ALL  | idx3_1            | NULL | NULL    | NULL |    5 |   100.00 | NULL                                           ||  1 | SIMPLE      | t1    | NULL       | ALL  | PRIMARY,idx1,idx2 | NULL | NULL    | NULL |    7 |    43.67 | Range checked for each record (index map: 0x7) |+----+-------------+-------+------------+------+-------------------+------+---------+------+------+----------+------------------------------------------------+                "plan_prefix": [                ],                "table": "`t1`",                "best_access_path": {                  "considered_access_paths": [                    {                      "rows_to_scan": 7,                      "filtering_effect": [                        {                          "condition": "(`t1`.`c2` < 5)", 对t1.c2的过滤系数估计用到了直方图                          "histogram_selectivity": 0.342857 这里过滤系数计算公式见下面                        }                      ],                      "final_filtering_effect": 1,                      "access_type": "scan",                      "resulting_rows": 7,                      "cost": 0.95,                      "chosen": true                    }                  ]                },计算公式:这里用的是get_less_than_selectivity方法,因为t1.c2<5满足第一个桶的范围,因此数据见下面previous_bucket_cumulative_frequency = 0.0found_bucket_frequency = 0.428571428结果 = previous_bucket_cumulative_frequency + (found_bucket_frequency * get_distance_from_lower(value))    = 0.0 + 0.428571428 * (value - lower_inclusive) / (get_upper_inclusive() - lower_inclusive + 1.0)    = 0.428571428 * (5-1) / (5-1+1)    = 0.428571428 * 0.8    = 0.342857
看一下实际的数据分布,可以看到小于5的数据实际只有1和4两条,实际在第一个桶的占比应该是2/3=67%,而上面计算出来的占比是80%,也就是比实际的占比会偏大。greatsql> SELECT c2,count(*) FROM t1 GROUP BY c2; +------+----------+| c2 | count(*) |+------+----------+| NULL | 1 || 1 | 1 || 4 | 1 || 5 | 1 || 10 | 2 || 16 | 1 |+------+----------+6 rows in set (0.00 sec)
复制代码


接着创建等宽直方图,看看结果。


greatsql> ANALYZE TABLE t1 UPDATE HISTOGRAM ON c2 WITH 5 BUCKETS;greatsql> select json_pretty(histogram)result from information_schema.column_statistics where table_name = 't1';| {  "buckets": [    [      1,      0.14285714285714285    ],    [      4,      0.2857142857142857    ],    [      5,      0.42857142857142855    ],    [      10,      0.7142857142857143    ],    [      16,      0.8571428571428571    ]  ],  "data-type": "int",  "null-values": 0.14285714285714285,  "collation-id": 8,  "last-updated": "2024-10-25 02:18:36.107382",  "sampling-rate": 1.0,  "histogram-type": "singleton",  "number-of-buckets-specified": 5} 
greatsql> EXPLAIN SELECT * FROM t1 join t3 ON t1.c1=t3.ccc1 or t1.c2<5;+----+-------------+-------+------------+------+-------------------+------+---------+------+------+----------+------------------------------------------------+| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |+----+-------------+-------+------------+------+-------------------+------+---------+------+------+----------+------------------------------------------------+| 1 | SIMPLE | t3 | NULL | ALL | idx3_1 | NULL | NULL | NULL | 5 | 100.00 | NULL || 1 | SIMPLE | t1 | NULL | ALL | PRIMARY,idx1,idx2 | NULL | NULL | NULL | 7 | 38.78 | Range checked for each record (index map: 0x7) | 过滤系数比等高的43.67低+----+-------------+-------+------------+------+-------------------+------+---------+------+------+----------+------------------------------------------------+ "plan_prefix": [ ], "table": "`t1`", "best_access_path": { "considered_access_paths": [ { "rows_to_scan": 7, "filtering_effect": [ { "condition": "(`t1`.`c2` < 5)", "histogram_selectivity": 0.285714 这个数据就是上面第二桶的频率值,对比上面等高直方图的0.342857更精确 } ], "final_filtering_effect": 1, "access_type": "scan", "resulting_rows": 7, "cost": 0.95, "chosen": true } ] },
复制代码


现在看一下之前在【GreatSQL 优化器-06】条件过滤导致选择非最佳用过的那个例子,看看在直方图的影响下结果是什么


greatsql>CREATE TABLE t3 (ccc1 INT, ccc2 int,ccc3 datetime(6));greatsql>INSERT INTO t3 VALUES (1,2,'2021-03-25 16:44:00.123456'),(2,10,'2021-03-25 16:44:00.123456'),(3,4,'2022-03-25 16:44:00.123456'),(4,6,'2023-03-25 16:44:00.123456'),(null,7,'2024-03-25 16:44:00.123456'),(4,3,'2024-04-25 16:44:00.123456'),(null,8,'2025-03-25 16:44:00.123456'),(3,4,'2022-06-25 16:44:00.123456'),(5,4,'2021-11-25 16:44:00.123456');greatsql>CREATE TABLE t4 (d1 INT, d2 int, d3 varchar(100));greatsql>INSERT INTO t4 VALUES (1,2,'aa1'),(2,1,'bb1'),(2,3,'cc1'),(3,3,'cc1'),(4,2,'ff1'),(4,4,'ert'),(4,2,'f5fg'),(null,2,'ee'),(5,30,'cc1'),(5,4,'fcc1'),(4,10,'cc1'),(6,4,'ccd1'),(null,1,'fee'),(1,2,'aa1'),(2,1,'bb1'),(2,3,'cc1'),(3,3,'cc1'),(4,2,'ff1'),(4,4,'ert'),(4,2,'f5fg'),(null,2,'ee'),(5,30,'cc1'),(5,4,'fcc1'),(4,10,'cc1'),(6,4,'ccd1'),(null,1,'fee'),(1,2,'aa1'),(2,1,'bb1'),(2,3,'cc1'),(3,3,'cc1'),(4,2,'ff1'),(4,4,'ert'),(4,2,'f5fg'),(null,2,'ee'),(5,30,'cc1'),(5,4,'fcc1'),(4,10,'cc1'),(6,4,'ccd1'),(null,1,'fee');greatsql>CREATE INDEX idx3_2 ON t3(ccc2);greatsql>CREATE INDEX idx3_3 ON t3(ccc3);greatsql>CREATE INDEX idx4_2 ON t4(d2);
首先看一下没有建立直方图之前的结果,这里t4用了全表扫描。greatsql> EXPLAIN SELECT * FROM t4 join t3 ON t4.d1=t3.ccc1 and t4.d2=t3.ccc2 where t4.d1<5 and t3.ccc3 < '2023-11-15';+----+-------------+-------+------------+------+---------------+--------+---------+-----------+------+----------+-------------+| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |+----+-------------+-------+------------+------+---------------+--------+---------+-----------+------+----------+-------------+| 1 | SIMPLE | t4 | NULL | ALL | idx4_2 | NULL | NULL | NULL | 39 | 33.33 | Using where || 1 | SIMPLE | t3 | NULL | ref | idx3_2,idx3_3 | idx3_2 | 5 | db1.t4.d2 | 1 | 11.11 | Using where |+----+-------------+-------+------------+------+---------------+--------+---------+-----------+------+----------+-------------+
-- 对t4没有索引的列d1创建直方图greatsql> ANALYZE TABLE t4 UPDATE HISTOGRAM ON d1 WITH 5 BUCKETS;
-- 可以看到结果已经变为更优的t3作为驱动表了,这里看出直方图对于估计结果更为精确。greatsql> EXPLAIN SELECT * FROM t4 join t3 ON t4.d1=t3.ccc1 and t4.d2=t3.ccc2 where t4.d1<5 and t3.ccc3 < '2023-11-15';+----+-------------+-------+------------+-------+---------------+--------+---------+-------------+------+----------+------------------------------------+| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |+----+-------------+-------+------------+-------+---------------+--------+---------+-------------+------+----------+------------------------------------+| 1 | SIMPLE | t3 | NULL | range | idx3_2,idx3_3 | idx3_3 | 9 | NULL | 6 | 100.00 | Using index condition; Using where || 1 | SIMPLE | t4 | NULL | ref | idx4_2 | idx4_2 | 5 | db1.t3.ccc2 | 6 | 6.15 | Using where |+----+-------------+-------+------------+-------+---------------+--------+---------+-------------+------+----------+------------------------------------+
复制代码

四、总结

从上面直方图的应用例子我们认识了直方图的使用场合和好处,也知道了等宽直方图比等高直方图的估算结果更精确,当然这个只适用于小表。在实际使用中,如果有表的列需要频繁用于查询条件过滤的话可以对该列创建直方图以得到更优的执行计划。

发布于: 刚刚阅读数: 3
用户头像

GreatSQL

关注

GreatSQL社区 2023-01-31 加入

GreatSQL是由万里数据库维护的MySQL分支,专注于提升MGR可靠性及性能,支持InnoDB并行查询特性,是适用于金融级应用的MySQL分支版本。 社区:https://greatsql.cn/ Gitee: https://gitee.com/GreatSQL/GreatSQL

评论

发布
暂无评论
【GreatSQL优化器-14】直方图应用_优化器_GreatSQL_InfoQ写作社区