有时候写的 SQL 有性能问题时往往束手无策,而求助于 DBA。今天,我们从使用者、DBA、内核开发三个不同的角度来分析一个有趣的 SQL 性能问题的案例, 从浅入深了解 postgreSQL 的优化器。
问题描述
同事 A 来问我这个假 DBA 一条 SQL 的性能问题:
两条 SQL 生产环境执行情况
limit 10
 select xxx from user_gift where user_id=11695667 and user_type = 'default' order by id desc limit 10;
   复制代码
 
Execution Time: 1.307 ms
limit 1
 select xxx from user_gift where user_id=11695667 and user_type = 'default' order by id desc limit 1;
   复制代码
 
Execution Time: 144.098 ms
分析
执行计划
既然不是缓存问题,那我们先看看执行计划有什么不一样的
limit 1
 # explain analyze verbose select xxx from user_gift where user_id=11695667 and user_type = 'default' order by id desc limit 1;                                                                                QUERY PLAN--------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit  (cost=0.43..416.25 rows=1 width=73) (actual time=135.213..135.214 rows=1 loops=1)   Output: xxx   ->  Index Scan Backward using user_gift_pkey on yay.user_gift  (cost=0.43..368000.44 rows=885 width=73) (actual time=135.212..135.212 rows=1 loops=1)         Output: xxx         Filter: ((user_gift.user_id = 11695667) AND (user_gift.user_type = 'default'::user_type))         Rows Removed by Filter: 330192 Planning Time: 0.102 ms Execution Time: 135.235 ms(8 rows)
Time: 135.691 ms
   复制代码
 
limit 10
 # explain analyze verbose select xxx from user_gift where user_id=11695667 and user_type = 'default' order by id desc limit 10;                                                                        QUERY PLAN---------------------------------------------------------------------------------------------------------------------------------------------------------- Limit  (cost=868.20..868.22 rows=10 width=73) (actual time=1.543..1.545 rows=10 loops=1)   Output: xxx   ->  Sort  (cost=868.20..870.41 rows=885 width=73) (actual time=1.543..1.543 rows=10 loops=1)         Output: xxx         Sort Key: user_gift.id DESC         Sort Method: top-N heapsort  Memory: 27kB         ->  Index Scan using idx_user_type on yay.user_gift  (cost=0.56..849.07 rows=885 width=73) (actual time=0.020..1.366 rows=775 loops=1)               Output: xxx               Index Cond: (user_gift.user_id = 11695667)               Filter: (user_gift.user_type = 'default'::user_type) Planning Time: 0.079 ms Execution Time: 1.564 ms(12 rows)
Time: 1.871 ms
   复制代码
 
可以看到,两个 SQL 执行计划不一样:
- limit 1语句 :使用主键进行倒序扫描,- Index Scan Backward using user_gift_pkey on yay.user_gift
 
 
- limit 10语句 :使用(user_id, user_type)复合索引直接查找用户数据,- Index Scan using idx_user_type on yay.user_gift
 
 
为什么执行计划不一样?
total cost
其实 postgreSQL 的执行计划并没有“问题”,因为limit 1的 total costLimit  (cost=0.43..416.25 rows=1 width=73) 是 416,run cost 是 416-0.43=415.57。而limit 10的 total costLimit  (cost=868.20..868.22 rows=10 width=73)是 868.22。
如果使用Index Scan Backward using user_gift_pkey的方式估算,那么limit 1成本是 415, limit 2是 415*2=830, limit 3 是 1245,大于 868,所以当limit 3的时候会使用Index Scan using idx_user_type扫索引的计划。
验证
 # explain  select xxx from user_gift where user_id=11695667 and user_type = 'default' order by id desc limit 2;                                                       QUERY PLAN------------------------------------------------------------------------------------------------------------------------- Limit  (cost=0.43..831.95 rows=2 width=73)   ->  Index Scan Backward using user_gift_pkey on user_gift  (cost=0.43..367528.67 rows=884 width=73)         Filter: ((user_id = 11695667) AND (user_type = 'default'::user_type))(3 rows) Time: 0.341 ms# explain  select xxx from user_gift where user_id=11695667 and user_type = 'default' order by id desc limit 3;                                                QUERY PLAN---------------------------------------------------------------------------------------------------------- Limit  (cost=866.19..866.20 rows=3 width=73)   ->  Sort  (cost=866.19..868.40 rows=884 width=73)         Sort Key: id DESC         ->  Index Scan using idx_user_type on user_gift  (cost=0.56..854.76 rows=884 width=73)               Index Cond: (user_id = 11695667)               Filter: (user_type = 'default'::user_type)(6 rows) Time: 0.352 ms
   复制代码
 
结果显示:
实际执行时间
limit 1时成本估算的是 416.25,比limit 10的868.22还是要快的。
但是实际limit 1执行 cost 是 135.691 ms,而limit 10执行 cost 是 1.871 ms,比limit 10慢了 70 倍!!!
我们重新执行下 explain,加上 buffers 选项
 # explain (analyze, buffers, verbose)  select xxx from user_gift where user_id=11695667 and user_type = 'default' order by id desc limit 1;                                                                                QUERY PLAN--------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit  (cost=0.43..416.29 rows=1 width=73) (actual time=451.542..451.544 rows=1 loops=1)   Output: xxx   Buffers: shared hit=214402 read=5280 dirtied=2302   I/O Timings: read=205.027   ->  Index Scan Backward using user_gift_pkey on yay.user_gift  (cost=0.43..368032.94 rows=885 width=73) (actual time=451.540..451.540 rows=1 loops=1)         Output: xxx         Filter: ((user_gift.user_id = 11695667) AND (user_gift.user_type = 'default'::user_type))         Rows Removed by Filter: 333462         Buffers: shared hit=214402 read=5280 dirtied=2302         I/O Timings: read=205.027 Planning Time: 1.106 ms Execution Time: 451.594 ms(12 rows)
   复制代码
 
 # explain (analyze, buffers, verbose)  select xxx from user_gift where user_id=11695667 and user_type = 'default' order by id desc limit 3;                                                                        QUERY PLAN----------------------------------------------------------------------------------------------------------------------------------------------------------- Limit  (cost=860.51..860.52 rows=3 width=73) (actual time=14.633..14.634 rows=3 loops=1)   Output: xxx   Buffers: shared hit=467 read=321   I/O Timings: read=10.112   ->  Sort  (cost=860.51..862.72 rows=885 width=73) (actual time=14.632..14.632 rows=3 loops=1)         Output: xxx         Sort Key: user_gift.id DESC         Sort Method: top-N heapsort  Memory: 25kB         Buffers: shared hit=467 read=321         I/O Timings: read=10.112         ->  Index Scan using idx_user_type on yay.user_gift  (cost=0.56..849.07 rows=885 width=73) (actual time=0.192..14.424 rows=775 loops=1)               Output: xxx               Index Cond: (user_gift.user_id = 11695667)               Filter: (user_gift.user_type = 'default'::user_type)               Buffers: shared hit=464 read=321               I/O Timings: read=10.112 Planning Time: 0.111 ms Execution Time: 14.658 ms(18 rows)
   复制代码
 
可以看出:
从上面输出Buffers: shared hit=214402 read=5280 dirtied=2302可以看出limit 1的计划从磁盘读取了 5280 个 blocks(pages)才找到符合 where 条件的记录。
为什么要读取这么多数据呢?我们来看看统计信息:
 schemaname             | yaytablename              | user_giftattname                | idinherited              | fnull_frac              | 0avg_width              | 8n_distinct             | -1most_common_vals       | most_common_freqs      | histogram_bounds       | {93,9817,19893,28177,.......}correlation            | 0.788011most_common_elems      | most_common_elem_freqs | elem_count_histogram   | 
schemaname             | yaytablename              | user_giftattname                | user_idinherited              | fnull_frac              | 0avg_width              | 4n_distinct             | -0.175761most_common_vals       | {11576819,10299480,14020501,.......,11695667,......}most_common_freqs      | {0.000353333,0.000326667,0.000246667,......,9.33333e-05,......}histogram_bounds       | {3,10002181,10005599,10009672,......,11693300,11698290,......}correlation            | 0.53375most_common_elems      | most_common_elem_freqs | elem_count_histogram   | 
schemaname             | yaytablename              | user_giftattname                | user_typeinherited              | fnull_frac              | 0avg_width              | 4n_distinct             | 3most_common_vals       | {default, invalid, deleted}most_common_freqs      | {0.997923,0.00194,0.000136667}histogram_bounds       | correlation            | 0.99763most_common_elems      | most_common_elem_freqs | elem_count_histogram   | 
   复制代码
 
从统计信息里可以看出:
- user_id字段的- most_common_vals中有 11695667(user_id)的值,则可以直接通过其对应的- most_common_freqs来得到其 selectivity 是 9.33333e-05;
 
- user_type字段为- default对应的 selectivity 是 0.997923。
 
- 所以- where user_id=11695667 and user_type='default'的 selectivity 是 0.0000933333*0.997923 = 0.0000931394467359。
 
那么可以估算出满足 where 条件的用户数是 0.0000931394467359 * 9499740(总用户数) =  884.8,和执行计划(cost=0.43..367528.67 rows=884 width=73)的 884 行一样。
而优化器的估算是基于数据分布均匀这个假设的:
那么数据分布真的均匀吗?继续查看数据的实际分布:
 # select max(ctid) from user_gift;     max------------- (128709,29)(1 row)
   复制代码
 
 # select max(ctid), min(ctid) from user_gift where user_id=11695667;     max     |    min-------------+----------- (124329,22) | (3951,64)(1 row)
   复制代码
 
 # SELECT relpages, reltuples FROM pg_class WHERE relname = 'user_gift'; relpages |  reltuples----------+-------------   128875 | 9.49974e+06(1 row)
   复制代码
 
每个 page 存储的记录数:9.49974e+06 tuples / 128875 pages = 73.713 tuples/page。
计算:表(main table)的 B+tree 的最大 page 是 128709,而实际用户 11695667 的最大 page 是 124329,128709 - 124329 = 4380,需要扫描 4380 个 page 才能找到符合 where 条件的记录所在的 page,所以过滤的 rows 是 4380 pages * 73.713 tuples/page ≈ 322862。
实际limit 1时扫描了 5280 个 pages(包含了主键索引的 pages),过滤了 333462 万行记录,和估算的基本一样:
 Rows Removed by Filter: 333462         Buffers: shared hit=214402 read=5280 dirtied=2302         I/O Timings: read=205.027
   复制代码
 
所以,此用户数据分布倾斜了:
那么扫描 5280 个 pages 要多久?
需要读取的数据量:5280pages * 8KB/page = 41.2MB 的数据。
 [root]$ fio -name iops -rw=randread -bs=8k -runtime=10 -iodepth=1 -filename /dev/sdb -ioengine mmap -buffered=1...Run status group 0 (all jobs):        READ: bw=965KiB/s (988kB/s), 965KiB/s-965KiB/s (988kB/s-988kB/s), io=9656KiB (9888kB), run=10005-10005msec
[root]$ fio -name iops -rw=read -bs=8k -runtime=10 -iodepth=1 -filename /dev/sdb -ioengine mmap -direct=1...Run status group 0 (all jobs):   READ: bw=513MiB/s (538MB/s), 513MiB/s-513MiB/s (538MB/s-538MB/s), io=5132MiB (5381MB), run=10001-10001msec
   复制代码
 
从fio结果可以看出,此数据库机器磁盘的顺序读取速度约为 500MB/s,如果数据都是顺序的,那么扫描 40MB 数据需要约 80ms,
如果数据都是随机的,那么需要 40 秒。不是所有的数据都是顺序访问的,而且测试的是非线上机器,没有其他 IO 进程在运行。
到这里问题基本定位清楚了:
postgreSQL 的优化器认为数据分布是均匀的,只需要倒序扫描很快就找到符合条件的记录,而实际上此用户的数据分布在表的前端,就导致了实际执行 start-up time 如此慢了。
从内核视角来分析
我们从 postgreSQL 内核的角度来继续分析几个问题:
- 优化器如何估算 cost 
- 优化器如何统计 actual time 
表的信息
 # \d user_gift;                                         Table "yay.user_gift"    Column    |           Type           | Collation | Nullable |                    Default--------------+--------------------------+-----------+----------+------------------------------------------------ id           | bigint                   |           | not null | nextval('user_gift_id_seq'::regclass) user_id      | integer                  |           | not null | ug_name | character varying(100)   |           | not null | expired_time | timestamp with time zone |           |          | now() created_time | timestamp with time zone |           | not null | now() updated_time | timestamp with time zone |           | not null | now() user_type   | user_type               |           | not null | 'default'::user_typeIndexes:    "user_gift_pkey" PRIMARY KEY, btree (id)    "idx_user_type" btree (user_id, ug_name)    "user_gift_ug_name_idx" btree (ug_name)
   复制代码
 
 # SELECT relpages, reltuples FROM pg_class WHERE relname = 'user_gift_pkey'; relpages |  reltuples----------+-------------    40035 | 9.49974e+06(1 row)
   复制代码
 
 # SELECT relpages, reltuples FROM pg_class WHERE relname = 'idx_user_type'; relpages |  reltuples----------+-------------   113572 | 9.49974e+06(1 row)
   复制代码
 
 # SELECT relpages, reltuples FROM pg_class WHERE relname = 'user_gift'; relpages |  reltuples----------+-------------   128875 | 9.49974e+06(1 row)
   复制代码
 
 =# select count(1) from user_gift where user_id=11695667; count-------   775(1 row)
=#  select count(1)  from user_gift where user_id=11695667 and user_type = 'default' ; count-------   775(1 row)
   复制代码
 
 # 主键高度# select * from  bt_metap('user_gift_pkey'); magic  | version | root | level | fastroot | fastlevel | oldest_xact | last_cleanup_num_tuples--------+---------+------+-------+----------+-----------+-------------+------------------------- 340322 |       3 |  412 |     2 |      412 |         2 |           0 |             9.31928e+06(1 row)
// idx_user_type 高度# select * from  bt_metap('idx_user_type'); magic  | version | root  | level | fastroot | fastlevel | oldest_xact | last_cleanup_num_tuples--------+---------+-------+-------+----------+-----------+-------------+------------------------- 340322 |       3 | 15094 |     3 |    15094 |         3 |           0 |             9.49974e+06(1 row)
   复制代码
 估算 cost
start-up cost
postgreSQL 对于每种索引的成本估算是不一样的,我们看看 B+tree 的 start-up 成本是如何估算的:
 // selfuncs.cvoidbtcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,               Cost *indexStartupCost, Cost *indexTotalCost,               Selectivity *indexSelectivity, double *indexCorrelation,               double *indexPages){    ......        descentCost = ceil(log(index->tuples) / log(2.0)) * cpu_operator_cost;    costs.indexStartupCost += descentCost;
    ......    // This cost is somewhat arbitrarily set at 50x cpu_operator_cost per page touched    descentCost = (index->tree_height + 1) * 50.0 * cpu_operator_cost;    costs.indexStartupCost += descentCost;        ......}
   复制代码
 
其实 start-up cost 估算很简单,只考虑从 B+tree 的 root page 遍历到 leaf page,且将这个 page 读入第一个 tuple(记录)的 cost。
start-up 估算公式如下:
 \left { ceil({\log_2 (N_{index,tuple})}) + (Height_{index} + 1) \times  50 \right }\ \times cpu_operator_cost
   
- N(index,tuple) :索引 tuples(记录)数量 
- Height(index) : 索引 B+tree 的高度 
- cpu_operator_cost : 默认值 0.0025 
使用 user_gift_pkey 计划的 start-up cost
从上面表信息中可以看出:
所以
 \left { ceil({\log_2 (9499740)}) + (2 + 1) \times  50 \right }\ \times cpu_operator_cost = 173 \times 0.0025 = 0.435
   
和 postgreSQL 估算的 start-up cost=0.43 一样。
使用 idx_user_type 计划的 start-up cost
- N(index,tuple) :9.49974e+06, 
- Height(index) : 3$\left { ceil({\log_2 (9499740)}) + (3 + 1) \times  50 \right }\ \times cpu_operator_cost = 223 \times 0.0025 = 0.5575$和 postgreSQL 估算的 start-up cost=0.56 一样。 
run cost
run cost 的估算是比较复杂的,判断的条件非常多,无法用一个固定的公式计算出来,所以这里只是简单描述下,有兴趣的可以看 postgreSQL 源码src/backend/optimizer/path/costsize.c的cost_index函数,针对这个案例,一般情况下可以根据此链接的脚本进行来模拟计算 cost。
run cost$runcost=索引成本+主表成本$
索引成本 $索引成本=随机读取索引相关pages的成本+操作相关tuples的成本$
主表成本 $主表成本=maxiocost+indexcorrelation2×(miniocost−maxiocost)$
max io cost(最坏情况下 IO 成本)
所有 pages 都是随机读取
 maxiocost=pagesfetched×randompagecost
   
min_io_cost(最优情况下 IO 成本)
第一个 page 是随机读取,后面 pages 都是顺序读取
 miniocost=1×randompagecost+(pagesfetched−1)×seqpagecost
   actual start-up time vs estimated start-up cost
刚刚的分析中有一个疑问被忽略了:estimated start-up cost 是开始执行计划到从表中读到的第一个 tuple 的 cost(cost is an arbitrary unit);而 actual start-up time 则是开始执行计划到从表中读取到第一个符合 where 条件的 tuple 的时间。这是为什么呢?
SQL 处理流程:postgreSQL 将 SQL 转化成 AST,然后进行优化,再将 AST 转成执行器(executor)来实现具体的操作。不进行优化的执行器是这样的:
   ┌──────────────┐  │  projection  │  └──────┬───────┘         │         │  ┌──────▼──────┐  │    limit    │  └──────┬──────┘         │         │  ┌──────▼──────┐  │  selection  │  └──────┬──────┘         │         │  ┌──────▼──────┐  │ index scan  │  └─────────────┘
   复制代码
 
简化的执行流程如下:
- index scan executor:扫描到一个 tuple,就返回给 selection executor 
- selection executor:对 tuple 进行过滤,如果符合条件则返回给 limit executor,如果不符合则继续调用 index scan executor 
- limit executor:当达到 limit 限制则将数据返回给 projection executor 
- projection executor:过滤掉非- select列的数据
 
那么如果进行优化,一般会将selection executor和projection executor合并到index scan executor中执行,以减少数据在 executor 之间的传递。
 ┌─────────────┐│    limit    │└──────┬──────┘       │       │┌──────▼──────┐│ index scan  ││             ││ + selection ││ + projection│└─────────────┘
   复制代码
 
优化后的执行流程:
而通过下面代码可以看出,postgreSQL 对于执行时间的统计是基于 executor 的,
 // src/backend/executor/execProcnode.cstatic TupleTableSlot *ExecProcNodeInstr(PlanState *node){    TupleTableSlot *result;    InstrStartNode(node->instrument);    result = node->ExecProcNodeReal(node);        // 统计执行指标    InstrStopNode(node->instrument, TupIsNull(result) ? 0.0 : 1.0);    return result;}
   复制代码
 
所以 actual time 的 start-up 是从启动 executor 直到扫描到符合where语句的第一条结果为止。
再看看实际的函数调用栈,user_id=xxx的过滤已经下沉到index scan executor里面了。
 --->    int4eq(FunctionCallInfo fcinfo) (/home/ken/cpp/postgres/src/backend/utils/adt/int.c:379)        ExecInterpExpr(ExprState * state, ExprContext * econtext, _Bool * isnull) (/home/ken/cpp/postgres/src/backend/executor/execExprInterp.c:704)        ExecInterpExprStillValid(ExprState * state, ExprContext * econtext, _Bool * isNull) (/home/ken/cpp/postgres/src/backend/executor/execExprInterp.c:1807)        ExecEvalExprSwitchContext(ExprState * state, ExprContext * econtext, _Bool * isNull) (/home/ken/cpp/postgres/src/include/executor/executor.h:322)--->    ExecQual(ExprState * state, ExprContext * econtext) (/home/ken/cpp/postgres/src/include/executor/executor.h:391)        ExecScan(ScanState * node, ExecScanAccessMtd accessMtd, ExecScanRecheckMtd recheckMtd) (/home/ken/cpp/postgres/src/backend/executor/execScan.c:227)--->    ExecIndexScan(PlanState * pstate) (/home/ken/cpp/postgres/src/backend/executor/nodeIndexscan.c:537)        ExecProcNodeInstr(PlanState * node) (/home/ken/cpp/postgres/src/backend/executor/execProcnode.c:466)        ExecProcNodeFirst(PlanState * node) (/home/ken/cpp/postgres/src/backend/executor/execProcnode.c:450)        ExecProcNode(PlanState * node) (/home/ken/cpp/postgres/src/include/executor/executor.h:248)--->    ExecLimit(PlanState * pstate) (/home/ken/cpp/postgres/src/backend/executor/nodeLimit.c:96)        ExecProcNodeInstr(PlanState * node) (/home/ken/cpp/postgres/src/backend/executor/execProcnode.c:466)        ExecProcNodeFirst(PlanState * node) (/home/ken/cpp/postgres/src/backend/executor/execProcnode.c:450)        ExecProcNode(PlanState * node) (/home/ken/cpp/postgres/src/include/executor/executor.h:248)        ExecutePlan(EState * estate, PlanState * planstate, _Bool use_parallel_mode, CmdType operation, _Bool sendTuples, uint64 numberTuples, ScanDirection direction, DestReceiver * dest, _Bool execute_once) (/home/ken/cpp/postgres/src/backend/executor/execMain.c:1632)        standard_ExecutorRun(QueryDesc * queryDesc, ScanDirection direction, uint64 count, _Bool execute_once) (/home/ken/cpp/postgres/src/backend/executor/execMain.c:350)        ExecutorRun(QueryDesc * queryDesc, ScanDirection direction, uint64 count, _Bool execute_once) (/home/ken/cpp/postgres/src/backend/executor/execMain.c:294)        ExplainOnePlan(PlannedStmt * plannedstmt, IntoClause * into, ExplainState * es, const char * queryString, ParamListInfo params, QueryEnvironment * queryEnv, const instr_time * planduration, const BufferUsage * bufusage) (/home/ken/cpp/postgres/src/backend/commands/explain.c:571)        ExplainOneQuery(Query * query, int cursorOptions, IntoClause * into, ExplainState * es, const char * queryString, ParamListInfo params, QueryEnvironment * queryEnv) (/home/ken/cpp/postgres/src/backend/commands/explain.c:404)        ExplainQuery(ParseState * pstate, ExplainStmt * stmt, ParamListInfo params, DestReceiver * dest) (/home/ken/cpp/postgres/src/backend/commands/explain.c:275)
   复制代码
 
下面代码是 scan 的实现,其中的ExecQual(qual, econtext)是对 tuple 进行过滤,因为 selection 已经合并到 scan 中了。
 TupleTableSlot *ExecScan(ScanState *node, ExecScanAccessMtd accessMtd, ExecScanRecheckMtd recheckMtd){    ......    for (;;)    {        TupleTableSlot *slot;        slot = ExecScanFetch(node, accessMtd, recheckMtd);        ......        econtext->ecxt_scantuple = slot;                // Note : selection判断        if (qual == NULL || ExecQual(qual, econtext))        {            if (projInfo)            {                return ExecProject(projInfo);            }            else            {                return slot;            }        }        else            InstrCountFiltered1(node, 1);    }}
   复制代码
 解决方案
禁用走主键扫描
既然计划走的是 user_gift_pkey 倒序扫描,那么我们可以手动避免优化器使用这个索引。
 # explain analyze verbose select xxx from user_gift where user_id=11695667 and user_type = 'default' order by id+0 desc limit 1;
   复制代码
 
将order by id改成order by id+0,由于id+0是个表达式所以优化器就就不会使用 user_gift_pkey 这个索引了。
这个方案不适合所有场景,如果数据分布均匀的话则某些情况下使用 user_gift_pkey 扫描更加合理。
增加(user_id, id)索引
 create index idx_user_id on user_gift(user_id, id);
   复制代码
 
通过增加 where 条件列和排序键的复合索引,来避免走主键扫描。
写在最后
从排除缓存因素,分析查询计划,定位数据分布倾斜,到调试内核源码来进一步确定原因,最终成功解决性能问题。通过这个有趣的 SQL 优化经历,相信能给大家带来收获。
评论