解密数据仓库 LLVM 技术神奇之处

2022 年 3 月 03 日
本文字数：3483 字
阅读完需：约 11 分钟

本文分享自华为云社区《GaussDB(DWS) 性能黑科技之LLVM技术解密》，作者：清道夫。

LLVM 这个名字最早源于底层虚拟机（Low Level VirTual Machine）的首字母缩写，但随着 LLVM 项目的演进，底层虚拟机的含义已经不再适用于 LLVM。现在谈到 LLVM，广义上讲就是指 LLVM 本身，它是一套用于开发编译前端与后端的工具套件，狭义上讲 LLVM 就是指整个编译套件的优化器及后端，而 CLANG 可以认为是 C/C++的前端。

LLVM 的优势？

传统编译器最常见的三阶段设计：

前端：解析源代码生成抽象语法树

优化器：根据优化规则对代码进行改善，相当于规则重写，例如消除冗余计算等

后端：将代码映射到目标指令集上，包括指令选择、寄存器分配和指令调度等。

GCC 是一个完整的可执行文件，没有为其他语言的开发者提供代码重用的接口，灵活性不足。

LLVM 也采用经典的三段式设计，但与传统编译器最大的不同就是针对不同语言都提供了同样一种中间表示 IR 以及模块化的后端（MCJIT 模块可以支持 JIT 编译）。

DWS 为什么要使用 LLVM？

为具体的查询生成定制化的机器码代替通用的函数实现，并尽可能的将数据存储在 CPU 的寄存器中：

（1）解决条件逻辑冗余的问题，需要用到 JIT（jiust-in-time）编译技术，LLVM 天然支持 JIT 技术。

（2）减少大量虚函数调用

（3）改善数据调用，将数据尽可能的从内存加载到 cache 上

（4）发挥通用硬件平台的扩展指令集功能，例如 SSE4.2

一个简单的例子，可以参照下面的优化前伪代码片段：

（1）首先是优化前的代码片段（a）：可以看到，在物化 tuple 的过程中，需要根据不同的列属性判断其偏移量，并调用相应的解析函数。

（2）借助 LLVM 使用 JIT 编译技术后，可以生成如下优化后的伪代码片段（b）：其中偏移量已经被提前计算出来，且无需判断列属性既可以调用对应的解析函数。减少了偏移量的重复计算及类型的重复判断。

优化前的代码（a）：

void MaterializeTuple(char *tuple){    for (int i = 0; i < num_slots; ++i) {        char *slot = tuple + offset[i];        switch(types[i]) {            case BOOLEAN:                *slot = ParseBoolean();                break;            case INT:                *slot = ParseInt();                break;            case FLOAT:...            case STRING:...            // etc.                            }    }}

复制代码

优化后的代码（b）：

void MaterializeTuple(char *tuple){    *(tuple + 0) = ParseInt();    *(tuple + 4) = ParseBoolean();    *(tuple + 5) = ParseInt();}

复制代码

如何使用 LLVM

在 DWS 中，涉及 LLVM 的 GUC 参数有两个：

（1）enable_codegen：总开关，用于控制是否开启 codegen，默认为 on

（2）codegen_cost_threshold：使用处理行数控制是否开启 codegen，默认门槛值为 10000。

目前 DWS 中并非是通过计划代价去控制是否开启 codegen，而是通过处理行数来控制的。此处 10000 是通过实验验证得出的优化值，不建议将此门槛值设置的过低，因为代码执行过程中的即时编译是有代价的。

简单示例：

test=# create table llvm_test(a int,b int)with(orientation=column);NOTICE:  The 'DISTRIBUTE BY' clause is not specified. Using 'a' as the distribution column by default.HINT:  Please use 'DISTRIBUTE BY' clause to specify suitable data distribution column.CREATE TABLEtest=# insert into llvm_test values(generate_series(1,10),generate_series(1,10));INSERT 0 10test=# show enable_codegen; enable_codegen---------------- on(1 row)
test=# show codegen_cost_threshold; codegen_cost_threshold------------------------ 10000(1 row)
test=# set codegen_cost_threshold=0; --为了简化，将门槛值设置为0SET
test=# set enable_fast_query_shipping=off; --关闭FQS，便于打印DN执行信息SETtest=# explain performance select * from llvm_test where b<10; --简单表达式会使用codegen

复制代码

                                     Datanode Information (identified by plan id) ---------------------------------------------------------------------------------------------------------------------   1 --Row Adapter         (actual time=14.929..15.393 rows=9 loops=1)         (CPU: ex c/r=5783, ex row=9, ex cyc=52048, inc cyc=46174324)   2 --Vector Streaming (type: GATHER)         (actual time=14.920..15.372 rows=9 loops=1)         (Buffers: 0)         (CPU: ex c/r=5124697, ex row=9, ex cyc=46122276, inc cyc=46122276)   3 --CStore Scan on public.llvm_test         LLVM Optimized         datanode1 (actual time=0.094..0.141 rows=9 loops=1) (filter time=0.002) (RoughCheck CU: CUNone: 1, CUSome: 9)         datanode1 (Buffers: shared hit=26 dirtied=1)         datanode1 (CPU: ex c/r=41430, ex row=10, ex cyc=414306, inc cyc=414306)

复制代码

此处仅截取部分执行信息，在Datanode Information中的扫描算子中有LLVM Optimized信息，代表已经使用了 JIT 编译。

如果想查看 LLVM JIT 编译的时间耗时，可以借助 GUC 参数analysis_options进行设置后，执行对应查询语句，在User Define Profiling中就可以看到 LLVM 的编译时间。

test=# set analysis_options='on(LLVM_COMPILE)';SETtest=# explain performance select * from llvm_test where b<10;

复制代码

此处仍使用之前用例的设置，不再赘述。

 User Define Profiling ---------------------------------------------------------------- Plan Node id: 2  Track name: coordinator get datanode connection        coordinator1: (time=0.015 total_calls=1 loops=1) Plan Node id: 3  Track name: load CU description        datanode1 (time=0.061 total_calls=10 loops=1) Plan Node id: 3  Track name: min/max check        datanode1 (time=0.008 total_calls=10 loops=1) Plan Node id: 3  Track name: fill vector batch        datanode1 (time=0.032 total_calls=9 loops=1) Plan Node id: 3  Track name: apply projection and filter        datanode1 (time=0.005 total_calls=9 loops=1) Plan Node id: 3  Track name: LLVM Compilation        datanode1 (time=11.024 total_calls=1 loops=1) Plan Node id: 3  Track name: fill later vector batch        datanode1 (time=0.000 total_calls=9 loops=1)

复制代码

其中LLVM Compilation即为 LLVM 的即时编译的时间代价。此处编译的时间代价通常会与 SQL 执行流程的复杂程度成正比关系，在实际的调优实践中可以结合此数据对处理数据行数的门槛值做进一步的调整。

LLVM 适用场景

目前 LLVM 仅支持 DN 上且是列存向量化执行路径的查询作业，其支持的表达式及算子如下：

支持 LLVM 的表达式
Case…when… 表达式
In 表达式
Bool 表达式 (And/Or/Not)
BooleanTest 表达式 (IS_NOT_KNOWN/IS_UNKNOWN/IS_TRUE/IS_NOT_TRUE/IS_FALSE/IS_NOT_FALSE)
NullTest 表达式 (IS_NOT_NULL/IS_NULL)
Operator 表达式
Function 表达式 (lpad, substring, btrim, rtrim, length)
Nullif 表达式

表达式计算支持的数据类型包括 bool, tinyint, smallint, int, bigint, float4, float8, numeric, date, time, timetz, timestamp, timestamptz, interval, bpchar, varchar, text, oid。

仅当表达式出现在向量化执行引擎中 Scan 节点的 filter、Hash Join 节点中的 complicate hash condition、hash join filter、hash join target, Nested Loop 节点中的 filter、join filter, Merge Join 节点的 merge join filter, merge join target, Group 节点中的 filter 表达式时，才会考虑是否使用 LLVM 动态编译优化。

支持 LLVM 的算子：
Join ：HashJoin/SonicHashJoin
Agg ：HashAgg
Sort

其中 HashJoin 算子仅支持 Hash Inner Join，对应的 hash cond 仅支持 int4、bigint、bpchar 类型的比较；HashAgg 算子仅支持针对 bigint、numeric 类型的 sum 及 avg 操作，且 group by 语句仅支持 int4、bigint、bpchar，text，varchar，timestamp 类型操作，同时支持 count(*)聚集操作。Sort 算子仅支持对 int4，bigint，numeric，bpchar，text，varchar 数据类型的比较操作。除此之外，无法使用 LLVM 动态编译优化，具体可通过 explain performance 工具进行显示（如上用例所示）。

想了解 GuassDB(DWS)更多信息，欢迎微信搜索“GaussDB DWS”关注微信公众号，和您分享最新最全的 PB 级数仓黑科技，后台还可获取众多学习资料哦~

点击关注，第一时间了解华为云新鲜技术~

发布于: 刚刚阅读数: 2

原文链接:【http://xie.infoq.cn/article/d17a6fbcc8c18d556ab7899d3】。文章转载请联系作者。