写点什么

数仓实践丨表扫描时过滤行数过多引起的性能瓶颈问题

  • 2023-11-08
    广东
  • 本文字数:1950 字

    阅读完需:约 6 分钟

数仓实践丨表扫描时过滤行数过多引起的性能瓶颈问题

本文分享自华为云社区《GaussDB(DWS)性能调优:表扫描时过滤行数过多引起的性能瓶颈问题案例》,作者: O 泡果奶~ 。

1、【问题描述】


SQL 语句执行过程中,对 12 亿数据量的大表进行扫描,过滤 99%的数据仅留 617 行数据,性能瓶颈位于扫描该表这里。

2、【原始语句】


set search_path = 'bi_dashboard';
WITH F_SRV_DB_DIM_PRD_D AS (SELECT EXTERNAL_NAME FROM ( SELECT MKT_NAME EXTERNAL_NAME FROM BI_DASHBOARD.DM_MSS_ITEM_PRODUCT_D PRD WHERE PRD.COMPANY_BRAND =any(array[string_to_array('HUAWEI',',')]) AND PRD.MKT_NAME =any(array[string_to_array('畅享 60,畅享 50,畅享 60X,畅享 60 Pro,畅享 50 Pro,畅享 50z,nova 10z,畅享 20e,畅享20 Pro,畅享 10e,畅享10 Plus,畅享20 SE,畅享10,nova 11i,畅享20 Plus,畅享9 Plus,畅享20 5G,nova Y90,畅享 10S,nova Y70,畅享Z,畅享 9S,nova 8 SE 活力版,麦芒9 5G,Y9s,麦芒9 5G',',')]) ) WHERE EXTERNAL_NAME<>'SNULL' GROUP BY EXTERNAL_NAME),
V_PERIOD AS ( SELECT PERIOD_ID AS PERIOD_ID_M, LEAST(TO_CHAR(PERIOD_END_DATE, 'YYYYMMDD'), '20230630') AS PERIOD_ID, PERIOD_ID AS DATES FROM BI_DASHBOARD.RPT_TML_ACCOUNT_PERIOD_D WHERE PERIOD_TYPE = 'M' AND PERIOD_ID BETWEEN 202207 AND 202306 ), V_DATA_BASE AS ( SELECT A.PERIOD_ID, IFNULL(A.CHANNEL_NAME, 'SNULL') AS DISTRIBUTOR_CHANNEL_NAME, SUM(A.SO_QTY_MTD) AS SO_QTY, SUM(DECODE(A.PERIOD_ID, 20230630, A.SO_QTY_MTD)) AS SO_QTY_ORDER select count(*) FROM DM_MSS_CN_PC_REP_RP_ST_D_F A INNER JOIN F_SRV_DB_DIM_PRD_D PRD ON A.EXTERNAL_NAME = PRD.EXTERNAL_NAME WHERE 1 = 1 AND A.CHANNEL_ID IN ('100013388802') AND A.ORG_KEY IN (10000651) AND A.SALES_FLAG IN ('1', '0') AND A.PERIOD_ID IN (20220731,20221031,20220930,20220831,20221130,20221231,20230131,20230228,20230430,20230331,20230531,20230630) AND (A.SO_QTY_MTD <> 0) -- 过滤所有日期SO_QTY为0的数据 GROUP BY A.PERIOD_ID, IFNULL(A.CHANNEL_NAME, 'SNULL') ), V_DATA AS ( SELECT PERIOD_ID, NVL(DISTRIBUTOR_CHANNEL_NAME, 'Total') AS DISTRIBUTOR_CHANNEL_NAME, SUM(SO_QTY) AS SO_QTY, SUM(SO_QTY_ORDER) AS SO_QTY_ORDER FROM V_DATA_BASE A GROUP BY GROUPING SETS ((PERIOD_ID), (PERIOD_ID, DISTRIBUTOR_CHANNEL_NAME)) )
SELECT STRING_AGG(P.DATES, ',' ORDER BY P.PERIOD_ID_M) AS PERIOD_LIST, B.DISTRIBUTOR_CHANNEL_NAME, STRING_AGG(NVL(TO_CHAR(ROUND(A.SO_QTY)), '0'), ',' ORDER BY P.PERIOD_ID_M) AS SO_QTY FROM V_PERIOD P FULL JOIN (SELECT DISTINCT DISTRIBUTOR_CHANNEL_NAME FROM V_DATA) B ON 1 = 1 LEFT JOIN V_DATA A ON A.PERIOD_ID = P.PERIOD_ID AND A.DISTRIBUTOR_CHANNEL_NAME = B.DISTRIBUTOR_CHANNEL_NAME GROUP BY B.DISTRIBUTOR_CHANNEL_NAME ORDER BY DECODE(B.DISTRIBUTOR_CHANNEL_NAME, 'Total', 0, 'SOURCE IS NULL', 2, '源为空', 3, 'SNULL', 4, 1), SUM(A.SO_QTY_ORDER) DESC NULLS LAST LIMIT 50 OFFSET 0
复制代码

3、【性能分析】




从上图的 performance 执行计划中可以看出(完整执行计划放在附件一),该 SQL 语句慢在扫描表 a(bi_dashboard.dm_mss_cn_pc_rep_rp_st_d_f_test)。扫描时过滤条件包括:sales_flag、so_qty_mtd、channel_id、org_key、period_id,该表上原本的局部聚簇键 PCK 只包含了 period_id,并没有包括其余三个过滤条件之一,因此,可以调整 PCK,以减少扫描表 a 的执行时间。

补充:局部聚簇键


局部聚簇 (Partial Cluster Key, 简称 PCK),列存储下一种通过 min/max 稀疏索引实现基表快速扫描的索引技术。Partial Cluster Key 可以指定多列,但是一般不建议超过 2 列。PCK 适用于列存大表点查询加速。


另外,查看语句中 where 条件中 in 值较多(12 个),在 DWS 中,in 后面的条件默认就只能是 5 个,超过 6 个就过滤不下推,此时,可以用 or 将 12 个值改写


A.PERIOD_ID IN (20220731,20221031,20220930,20220831,20221130)or A.PERIOD_ID IN (20221231,20230131,20230228,20230430,20230331)or A.PERIOD_ID IN (20230531,20230630)
复制代码



此时,SQL 语句执行时间减少为 487ms,完整 performance 计划如附件二所示。



点击关注,第一时间了解华为云新鲜技术~

发布于: 刚刚阅读数: 5
用户头像

提供全面深入的云计算技术干货 2020-07-14 加入

生于云,长于云,让开发者成为决定性力量

评论

发布
暂无评论
数仓实践丨表扫描时过滤行数过多引起的性能瓶颈问题_数据库_华为云开发者联盟_InfoQ写作社区