序
Hello,大家好,这里是百里,在当今世界科技进步的年代,从造数据到用数据逐步进化。如何从数据中获得更有用的价值,是任何一家公司都需要面临的问题。那么能够更快,更好,更精确的获取数据为业务部门提供决策支持是企业竞争的关键。
在百里工作当中,在以往的项目中也使用过数据仓库的,如 Hadoop 或者 Spark ,他们都是很好的数据仓库。我们今天的测试对象是,字节跳动开源的云原生数据仓库 ByConity。 那么它的如何使用呢,以及效果如何呢。我们一步步揭晓。
ByConity 的背景
ByConity,字节跳动开源的云原生数据仓库,旨在满足用户多样化的数据分析需求。在 2024 年 8 月,ByConity 引入了 BSP(Bulk Synchronous Parallel)模式,这一创新功能不仅实现了任务级别的容错机制,还提供了更精细的调度能力,并基于资源感知进行了优化调度。通过 BSP 模式,ByConity 致力于将数据处理(T)过程内嵌于平台之中,从而实现数据接入、处理和分析的无缝集成,为用户提供一站式的数据解决方案。
ByteHouse 对传统输出仓库做了大量优化,支持大批量并行写入,更稳且更快。
同时针对实时数仓存在频繁更新的特点,使用重叠窗口进行批量 ETL 操作时,会带来大量的数据更新。在这种场景下,ByteHouse 做了大量的优化。
上手体验
测试环境
登录准备
我这里使用的是 ssh 工具(Xshell8) 登录,当然如果是 linux 或者用 windocs 自带的 shell 工具也可以。
输入对应的账号密码后进入服务器。
为了避免使用中连接超时断开,这里执行
tmux new -s user0001
clickhouse client --port 9010
复制代码
命令创建一个新的 tmux 会话,并进入客户端。
查询命令
第一步,先进入数据库执行
由于 TPC-DS 定义的查询语法为标准 SQL,设置数据库会话的方言类型为 ANSI:
set dialect_type = 'ANSI'
复制代码
少量数据查询测试
我们这里执行
WITH customer_total_return
AS (SELECT wr_returning_customer_sk AS ctr_customer_sk,
ca_state AS ctr_state,
Sum(wr_return_amt) AS ctr_total_return
FROM web_returns,
date_dim,
customer_address
WHERE wr_returned_date_sk = d_date_sk
AND d_year = 2000
AND wr_returning_addr_sk = ca_address_sk
GROUP BY wr_returning_customer_sk,
ca_state)
SELECT c_customer_id,
c_salutation,
c_first_name,
c_last_name,
c_preferred_cust_flag,
c_birth_day,
c_birth_month,
c_birth_year,
c_birth_country,
c_login,
c_email_address,
c_last_review_date,
ctr_total_return
FROM customer_total_return ctr1,
customer_address,
customer
WHERE ctr1.ctr_total_return > (SELECT Avg(ctr_total_return) * 1.2
FROM customer_total_return ctr2
WHERE ctr1.ctr_state = ctr2.ctr_state)
AND ca_address_sk = c_current_addr_sk
AND ca_state = 'IN'
AND ctr1.ctr_customer_sk = c_customer_sk
ORDER BY c_customer_id,
c_salutation,
c_first_name,
c_last_name,
c_preferred_cust_flag,
c_birth_day,
c_birth_month,
c_birth_year,
c_birth_country,
c_login,
c_email_address,
c_last_review_date,
ctr_total_return
LIMIT 100;
复制代码
可见执行成功,速度非常快。
Elapsed: 2.086Processed 12.07 million rows, 72.20 MB (5.79 million rows/s., 34.60 MB/s.)
2 秒处理了 1200 万行数据,处理了 72m 的数据。
面对相对少量数据时,无需开启 BSP,即可获取数据。
大量数据查询测试
with ws as
(select d_year AS ws_sold_year, ws_item_sk,
ws_bill_customer_sk ws_customer_sk,
sum(ws_quantity) ws_qty,
sum(ws_wholesale_cost) ws_wc,
sum(ws_sales_price) ws_sp
from web_sales
left join web_returns on wr_order_number=ws_order_number and ws_item_sk=wr_item_sk
join date_dim on ws_sold_date_sk = d_date_sk
where wr_order_number is null
group by d_year, ws_item_sk, ws_bill_customer_sk
),
cs as
(select d_year AS cs_sold_year, cs_item_sk,
cs_bill_customer_sk cs_customer_sk,
sum(cs_quantity) cs_qty,
sum(cs_wholesale_cost) cs_wc,
sum(cs_sales_price) cs_sp
from catalog_sales
left join catalog_returns on cr_order_number=cs_order_number and cs_item_sk=cr_item_sk
join date_dim on cs_sold_date_sk = d_date_sk
where cr_order_number is null
group by d_year, cs_item_sk, cs_bill_customer_sk
),
ss as
(select d_year AS ss_sold_year, ss_item_sk,
ss_customer_sk,
sum(ss_quantity) ss_qty,
sum(ss_wholesale_cost) ss_wc,
sum(ss_sales_price) ss_sp
from store_sales
left join store_returns on sr_ticket_number=ss_ticket_number and ss_item_sk=sr_item_sk
join date_dim on ss_sold_date_sk = d_date_sk
where sr_ticket_number is null
group by d_year, ss_item_sk, ss_customer_sk
)
select
ss_sold_year, ss_item_sk, ss_customer_sk,
round(ss_qty/(coalesce(ws_qty,0)+coalesce(cs_qty,0)),2) ratio,
ss_qty store_qty, ss_wc store_wholesale_cost, ss_sp store_sales_price,
coalesce(ws_qty,0)+coalesce(cs_qty,0) other_chan_qty,
coalesce(ws_wc,0)+coalesce(cs_wc,0) other_chan_wholesale_cost,
coalesce(ws_sp,0)+coalesce(cs_sp,0) other_chan_sales_price
from ss
left join ws on (ws_sold_year=ss_sold_year and ws_item_sk=ss_item_sk and ws_customer_sk=ss_customer_sk)
left join cs on (cs_sold_year=ss_sold_year and cs_item_sk=ss_item_sk and cs_customer_sk=ss_customer_sk)
where (coalesce(ws_qty,0)>0 or coalesce(cs_qty, 0)>0) and ss_sold_year=2000
order by
ss_sold_year, ss_item_sk, ss_customer_sk,
ss_qty desc, ss_wc desc, ss_sp desc,
other_chan_qty,
other_chan_wholesale_cost,
other_chan_sales_price,
ratio
LIMIT 100;
复制代码
此时会因为内限制造成无法查询,查询失败后,在失败的 SQL 最后加上设置后再次执行:
(注意 sql 拼接处,需要拿掉末尾的‘;’)
SETTINGS
bsp_mode = 1,
distributed_max_parallel_size = 4;
我们这里设置为 4(大小设置为 4 的倍数),仍然不行。
再次设置为 12
SETTINGS
bsp_mode = 1,
distributed_max_parallel_size = 12;
Elapsed: 58.904sec Processed 504.21 million rows, 6.05GB (8.56 million rows/s., 102.68 MB/s.)
58 秒处理了 5 亿行数据,处理了 6GB 的数据。
可见面对大量数据时我们启动 BSP 计算模型,通过把 0 改成 1,启用从而进行处理大规模数据集,同时将分布式环境中可以并行执行的最大任务数为 12 个任务并行。可以解决大规模数据增加效率,平衡负载并减少处理时间。
内存溢出 OOM 测试
通过设置内存值,进行测试内存溢出情况
WITH all_sales AS (
SELECT d_year
,i_brand_id
,i_class_id
,i_category_id
,i_manufact_id
,SUM(sales_cnt) AS sales_cnt
,SUM(sales_amt) AS sales_amt
FROM (SELECT d_year
,i_brand_id
,i_class_id
,i_category_id
,i_manufact_id
,cs_quantity - COALESCE(cr_return_quantity,0) AS sales_cnt
,cs_ext_sales_price - COALESCE(cr_return_amount,0.0) AS sales_amt
FROM catalog_sales JOIN item ON i_item_sk=cs_item_sk
JOIN date_dim ON d_date_sk=cs_sold_date_sk
LEFT JOIN catalog_returns ON (cs_order_number=cr_order_number
AND cs_item_sk=cr_item_sk)
WHERE i_category='Books'
UNION
SELECT d_year
,i_brand_id
,i_class_id
,i_category_id
,i_manufact_id
,ss_quantity - COALESCE(sr_return_quantity,0) AS sales_cnt
,ss_ext_sales_price - COALESCE(sr_return_amt,0.0) AS sales_amt
FROM store_sales JOIN item ON i_item_sk=ss_item_sk
JOIN date_dim ON d_date_sk=ss_sold_date_sk
LEFT JOIN store_returns ON (ss_ticket_number=sr_ticket_number
AND ss_item_sk=sr_item_sk)
WHERE i_category='Books'
UNION
SELECT d_year
,i_brand_id
,i_class_id
,i_category_id
,i_manufact_id
,ws_quantity - COALESCE(wr_return_quantity,0) AS sales_cnt
,ws_ext_sales_price - COALESCE(wr_return_amt,0.0) AS sales_amt
FROM web_sales JOIN item ON i_item_sk=ws_item_sk
JOIN date_dim ON d_date_sk=ws_sold_date_sk
LEFT JOIN web_returns ON (ws_order_number=wr_order_number
AND ws_item_sk=wr_item_sk)
WHERE i_category='Books') sales_detail
GROUP BY d_year, i_brand_id, i_class_id, i_category_id, i_manufact_id)
SELECT prev_yr.d_year AS prev_year
,curr_yr.d_year AS year
,curr_yr.i_brand_id
,curr_yr.i_class_id
,curr_yr.i_category_id
,curr_yr.i_manufact_id
,prev_yr.sales_cnt AS prev_yr_cnt
,curr_yr.sales_cnt AS curr_yr_cnt
,curr_yr.sales_cnt-prev_yr.sales_cnt AS sales_cnt_diff
,curr_yr.sales_amt-prev_yr.sales_amt AS sales_amt_diff
FROM all_sales curr_yr, all_sales prev_yr
WHERE curr_yr.i_brand_id=prev_yr.i_brand_id
AND curr_yr.i_class_id=prev_yr.i_class_id
AND curr_yr.i_category_id=prev_yr.i_category_id
AND curr_yr.i_manufact_id=prev_yr.i_manufact_id
AND curr_yr.d_year=2002
AND prev_yr.d_year=2002-1
AND CAST(curr_yr.sales_cnt AS DECIMAL(17,2))/CAST(prev_yr.sales_cnt AS DECIMAL(17,2))<0.9
ORDER BY sales_cnt_diff,sales_amt_diff
limit 100
SETTINGS max_memory_usage=5000000000;
复制代码
此时触发了 Out Of Memory
would use 4.05 gb attempt to allocate chunk of 503383360 bytes , maximum 4.66 gb
此时可以选择增加内存增加到(10000000000)
或者设置bsp_mode = 1,
distributed_max_parallel_size = 12
均可以查询出内容。
使用大模型解析
我们可以使用大模型(豆包)对一些 sql 进行语句解析,这样更方便我们查询数据时更好理解相关业务场景。
使用总结
使用对比
对比 Spark 的使用场景,ByConity 支持多级资源隔离,不同业务、不同场景按需创建 Virtual Warehouse,实现物理资源隔离和读写分离,同时保证数据读写的强一致性,确保数据始终是最新的。这无疑是巨大提升,同时,因为存储计算分离架构的原因,可以实现实时扩容,这可以更好的利用资源,和降低使用成本。
使用体验
ByConity 的上手难度相对较低,ByConity 提供了详尽的官方文档,完备的社区可以支持用户快速的使用。整体上手难度偏低,是一个效率高,成本低,且资源隔离和弹性扩缩容都不错的数据仓库。
评论