写点什么

ByConity ELT 初体验

作者:阿泽🧸
  • 2024-12-02
    北京
  • 本文字数:3337 字

    阅读完需:约 11 分钟

ByConity ELT 初体验

ByConity 是一款开源云原生数据仓库,可以满足用户的多种数据分析场景。ByConity 增加了 bsp 模式:可以进行 task 级别的容错;更细粒度的调度;基于资源感知的调度。希望通过 bsp 能力,把数据加工(T)的过程转移到 ByConity 内部,能够一站式完成数据接入、加工和分析。


体验步骤如下:


1、使用 SSH 工具登录到测试环境的 ECS 上。


2、运行 tmux new -s user001。

3、执行 clickhouse client --port 9010 -mn 进入 clickhouse。


4、使用测试用数据库 test_elt:use test_elt


5、设置数据库会话的方言类型为 ANSI:set dialect_type = 'ANSI'

6、查询如下 SQL,会因为内存限制而执行失败。

with ws as        (select d_year AS ws_sold_year, ws_item_sk,        ws_bill_customer_sk ws_customer_sk,        sum(ws_quantity) ws_qty,        sum(ws_wholesale_cost) ws_wc,        sum(ws_sales_price) ws_sp        from web_sales        left join web_returns on wr_order_number=ws_order_number and ws_item_sk=wr_item_sk        join date_dim on ws_sold_date_sk = d_date_sk        where wr_order_number is null        group by d_year, ws_item_sk, ws_bill_customer_sk        ),        cs as        (select d_year AS cs_sold_year, cs_item_sk,        cs_bill_customer_sk cs_customer_sk,        sum(cs_quantity) cs_qty,        sum(cs_wholesale_cost) cs_wc,        sum(cs_sales_price) cs_sp        from catalog_sales        left join catalog_returns on cr_order_number=cs_order_number and cs_item_sk=cr_item_sk        join date_dim on cs_sold_date_sk = d_date_sk        where cr_order_number is null        group by d_year, cs_item_sk, cs_bill_customer_sk        ),        ss as        (select d_year AS ss_sold_year, ss_item_sk,        ss_customer_sk,        sum(ss_quantity) ss_qty,        sum(ss_wholesale_cost) ss_wc,        sum(ss_sales_price) ss_sp        from store_sales        left join store_returns on sr_ticket_number=ss_ticket_number and ss_item_sk=sr_item_sk        join date_dim on ss_sold_date_sk = d_date_sk        where sr_ticket_number is null        group by d_year, ss_item_sk, ss_customer_sk        )        select        ss_sold_year, ss_item_sk, ss_customer_sk,        round(ss_qty/(coalesce(ws_qty,0)+coalesce(cs_qty,0)),2) ratio,        ss_qty store_qty, ss_wc store_wholesale_cost, ss_sp store_sales_price,        coalesce(ws_qty,0)+coalesce(cs_qty,0) other_chan_qty,        coalesce(ws_wc,0)+coalesce(cs_wc,0) other_chan_wholesale_cost,        coalesce(ws_sp,0)+coalesce(cs_sp,0) other_chan_sales_price        from ss        left join ws on (ws_sold_year=ss_sold_year and ws_item_sk=ss_item_sk and ws_customer_sk=ss_customer_sk)        left join cs on (cs_sold_year=ss_sold_year and cs_item_sk=ss_item_sk and cs_customer_sk=ss_customer_sk)        where (coalesce(ws_qty,0)>0 or coalesce(cs_qty, 0)>0) and ss_sold_year=2000        order by        ss_sold_year, ss_item_sk, ss_customer_sk,        ss_qty desc, ss_wc desc, ss_sp desc,        other_chan_qty,        other_chan_wholesale_cost,        other_chan_sales_price,        ratio        LIMIT 100        SETTINGS bsp_mode = 1,distributed_max_parallel_size = 12;
复制代码



7、查询失败后,在失败的 SQL 最后加上设置后再次执行:

SETTINGSbsp_mode = 1,distributed_max_parallel_size = 12;

其中参数distributed_max_parallel_size可以设置为 4 的其他整数倍(因为 Worker 的数量为 4)。

添加参数后执行成功。


8、执行如下查询,在执行成功的查询中,添加参数限制查询的最大内存使用量,如:

SETTINGSmax_memory_usage=40000000000;

(单位为 B,当前约合 37.25 GB)将内存限制为合适的值,引发 oom。

WITH all_sales AS (        SELECT d_year        ,i_brand_id        ,i_class_id        ,i_category_id        ,i_manufact_id        ,SUM(sales_cnt) AS sales_cnt        ,SUM(sales_amt) AS sales_amt        FROM (SELECT d_year        ,i_brand_id        ,i_class_id        ,i_category_id        ,i_manufact_id        ,cs_quantity - COALESCE(cr_return_quantity,0) AS sales_cnt        ,cs_ext_sales_price - COALESCE(cr_return_amount,0.0) AS sales_amt        FROM catalog_sales JOIN item ON i_item_sk=cs_item_sk        JOIN date_dim ON d_date_sk=cs_sold_date_sk        LEFT JOIN catalog_returns ON (cs_order_number=cr_order_number        AND cs_item_sk=cr_item_sk)        WHERE i_category='Books'        UNION        SELECT d_year        ,i_brand_id        ,i_class_id        ,i_category_id        ,i_manufact_id        ,ss_quantity - COALESCE(sr_return_quantity,0) AS sales_cnt        ,ss_ext_sales_price - COALESCE(sr_return_amt,0.0) AS sales_amt        FROM store_sales JOIN item ON i_item_sk=ss_item_sk        JOIN date_dim ON d_date_sk=ss_sold_date_sk        LEFT JOIN store_returns ON (ss_ticket_number=sr_ticket_number        AND ss_item_sk=sr_item_sk)        WHERE i_category='Books'        UNION        SELECT d_year        ,i_brand_id        ,i_class_id        ,i_category_id        ,i_manufact_id        ,ws_quantity - COALESCE(wr_return_quantity,0) AS sales_cnt        ,ws_ext_sales_price - COALESCE(wr_return_amt,0.0) AS sales_amt        FROM web_sales JOIN item ON i_item_sk=ws_item_sk        JOIN date_dim ON d_date_sk=ws_sold_date_sk        LEFT JOIN web_returns ON (ws_order_number=wr_order_number        AND ws_item_sk=wr_item_sk)        WHERE i_category='Books') sales_detail        GROUP BY d_year, i_brand_id, i_class_id, i_category_id, i_manufact_id)        SELECT prev_yr.d_year AS prev_year        ,curr_yr.d_year AS year        ,curr_yr.i_brand_id        ,curr_yr.i_class_id        ,curr_yr.i_category_id        ,curr_yr.i_manufact_id        ,prev_yr.sales_cnt AS prev_yr_cnt        ,curr_yr.sales_cnt AS curr_yr_cnt        ,curr_yr.sales_cnt-prev_yr.sales_cnt AS sales_cnt_diff        ,curr_yr.sales_amt-prev_yr.sales_amt AS sales_amt_diff        FROM all_sales curr_yr, all_sales prev_yr        WHERE curr_yr.i_brand_id=prev_yr.i_brand_id        AND curr_yr.i_class_id=prev_yr.i_class_id        AND curr_yr.i_category_id=prev_yr.i_category_id        AND curr_yr.i_manufact_id=prev_yr.i_manufact_id        AND curr_yr.d_year=2002        AND prev_yr.d_year=2002-1        AND CAST(curr_yr.sales_cnt AS DECIMAL(17,2))/CAST(prev_yr.sales_cnt AS DECIMAL(17,2))<0.9        ORDER BY sales_cnt_diff,sales_amt_diff        limit 100        SETTINGS max_memory_usage=20000000000        SETTINGS bsp_mode = 1,distributed_max_parallel_size = 12;
复制代码


执行成功,再将内存调小为 30000000000。还是可以成功



执行成功,再将内存调小为 20000000000。还是可以。


内存调小为 2000000000,报错 OOM。

在失败的 SQL 最后加上设置后再次执行:

SETTINGSbsp_mode = 1,distributed_max_parallel_size = 12;执行成功。

9、体验感受。


在 BSP 模式下,ByConity 支持对 TableScan 算子的并行度进行扩展,这有助于在资源有限的情况下实现对大表的处理。用户可以通过设置distributed_max_parallel_size参数来控制 TableScan 的并行度,实现资源平铺的功能。

发布于: 2024-12-02阅读数: 4408
用户头像

阿泽🧸

关注

还未添加个人签名 2020-11-12 加入

还未添加个人简介

评论

发布
暂无评论
ByConity ELT 初体验_ByConity_阿泽🧸_InfoQ写作社区