数据湖（十）：Hive 与 Iceberg 整合

作者：Lansonli

2022-10-25
广东
本文字数：2865 字
阅读完需：约 9 分钟

Hive 与 Iceberg 整合

Iceberg 就是一种表格式，支持使用 Hive 对 Iceberg 进行读写操作，但是对 Hive 的版本有要求，如下：

这里基于 Hive3.1.2 版本进行 Hive 操作 Iceberg 表讲解。

一、开启 Hive 支持 Iceberg

1、下载 iceberg-hive-runtime.jar

想要使用 Hive 支持查询 Iceberg 表，首先需要下载“iceberg-hive-runtime.jar”，Hive 通过该 Jar 可以加载 Hive 或者更新 Iceberg 表元数据信息。下载地址：https://iceberg.apache.org/#releases/：

将以上 jar 包下载后，上传到 Hive 服务端和客户端对应的 lib 目录下。另外在向 Hive 中 Iceberg 格式表插入数据时需要到“libfb303-0.9.3.jar”包，将此包也上传到 Hive 服务端和客户端对应的 lib 目录下。

2、配置 hive-site.xml

在 Hive 客户端 $HIVE_HOME/conf/hive-site.xml 中添加如下配置：

<property>    <name>iceberg.engine.hive.enabled</name>    <value>true</value></property>

复制代码

二、Hive 中操作 Iceberg 格式表

从 Hive 引擎的角度来看，在运行环境中有 Catalog 概念（catalog 主要描述了数据集的位置信息，就是元数据），Hive 与 Iceberg 整合时，Iceberg 支持多种不同的 Catalog 类型，例如:Hive、Hadoop、第三方厂商的 AWS Glue 和自定义 Catalog。在实际应用场景中，Hive 可能使用上述任意 Catalog，甚至跨不同 Catalog 类型 join 数据，为此 Hive 提供了 org.apache.iceberg.mr.hive.HiveIcebergStorageHandler（位于包 iceberg-hive-runtime.jar）来支持读写 Iceberg 表，并通过在 Hive 中设置“iceberg.catalog..type”属性来决定加载 Iceberg 表的方式，该属性可以配置：hive、hadoop，其中“”是自己随便定义的名称，主要是在 hive 中创建 Iceberg 格式表时配置 iceberg.catalog 属性使用。

在 Hive 中创建 Iceberg 格式表时，根据创建 Iceberg 格式表时是否指定 iceberg.catalog 属性值，有以下三种方式决定 Iceberg 格式表如何加载（数据存储在什么位置）。

1、如果没有设置 iceberg.catalog 属性，默认使用 HiveCatalog 来加载

这种方式就是说如果在 Hive 中创建 Iceberg 格式表时，不指定 iceberg.catalog 属性，那么数据存储在对应的 hive warehouse 路径下。

在 Hive 客户端 node3 节点进入 Hive，操作如下：

#在Hive中创建iceberg格式表create table test_iceberg_tbl1(id int ,name string,age int) partitioned by (dt string) stored by 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler';
#在Hive中加载如下两个包，在向Hive中插入数据时执行MR程序时需要使用到hive> add jar /software/hive-3.1.2/lib/iceberg-hive-runtime-0.12.1.jar;hive> add jar /software/hive-3.1.2/lib/libfb303-0.9.3.jar;
#向表中插入数据hive> insert into test_iceberg_tbl1 values (1,"zs",18,"20211212");
#查询表中的数据hive> select * from test_iceberg_tbl1;OK1  zs  18  20211212

复制代码

在 Hive 默认的 warehouse 目录下可以看到创建的表目录：

2、如果设置了 iceberg.catalog 对应的 catalog 名字，就用对应类型的 catalog 加载

这种情况就是说在 Hive 中创建 Iceberg 格式表时，如果指定了 iceberg.catalog 属性值，那么数据存储在指定的 catalog 名称对应配置的目录下。

在 Hive 客户端 node3 节点进入 Hive，操作如下：

#注册一个HiveCatalog叫another_hivehive> set iceberg.catalog.another_hive.type=hive;
#在Hive中创建iceberg格式表create table test_iceberg_tbl2(id int,name string,age int)partitioned by (dt string)stored by 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler'tblproperties ('iceberg.catalog'='another_hive');
#在Hive中加载如下两个包，在向Hive中插入数据时执行MR程序时需要使用到hive> add jar /software/hive-3.1.2/lib/iceberg-hive-runtime-0.12.1.jar;hive> add jar /software/hive-3.1.2/lib/libfb303-0.9.3.jar;
#插入数据，并查询hive> insert into test_iceberg_tbl2 values (2,"ls",20,"20211212");hive> select * from test_iceberg_tbl2;OK2  ls  20  20211212

复制代码

以上方式指定“iceberg.catalog.another_hive.type=hive”后，实际上就是使用的 hive 的 catalog，这种方式与第一种方式不设置效果一样，创建后的表存储在 hive 默认的 warehouse 目录下。也可以在建表时指定 location 写上路径，将数据存储在自定义对应路径上。

除了可以将 catalog 类型指定成 hive 之外，还可以指定成 hadoop，在 Hive 中创建对应的 iceberg 格式表时需要指定 location 来指定 iceberg 数据存储的具体位置，这个位置是具有一定格式规范的自定义路径。在 Hive 客户端 node3 节点进入 Hive，操作如下：

#注册一个HadoopCatalog叫hadoophive> set iceberg.catalog.hadoop.type=hadoop;
#使用HadoopCatalog时，必须设置“iceberg.catalog.<catalog_name>.warehouse”指定warehouse路径hive> set iceberg.catalog.hadoop.warehouse=hdfs://mycluster/iceberg_data;
#在Hive中创建iceberg格式表,这里创建成外表create external table test_iceberg_tbl3(id int,name string,age int)partitioned by (dt string)stored by 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler'location 'hdfs://mycluster/iceberg_data/default/test_iceberg_tbl3'tblproperties ('iceberg.catalog'='hadoop');
注意：以上location指定的路径必须是“iceberg.catalog.hadoop.warehouse”指定路径的子路径,格式必须是${iceberg.catalog.hadoop.warehouse}/${当前建表使用的hive库}/${创建的当前iceberg表名}
#在Hive中加载如下两个包，在向Hive中插入数据时执行MR程序时需要使用到hive> add jar /software/hive-3.1.2/lib/iceberg-hive-runtime-0.12.1.jar;hive> add jar /software/hive-3.1.2/lib/libfb303-0.9.3.jar;
#插入数据，并查询hive> insert into test_iceberg_tbl3 values (3,"ww",20,"20211213");hive> select * from test_iceberg_tbl3;OK3  ww  20  20211213

复制代码

在指定的“iceberg.catalog.hadoop.warehouse”路径下可以看到创建的表目录：

3、如果 iceberg.catalog 属性设置为“location_based_table”,可以从指定的根路径下加载 Iceberg 表

这种情况就是说如果 HDFS 中已经存在 iceberg 格式表，我们可以通过在 Hive 中创建 Icerberg 格式表指定对应的 location 路径映射数据。，在 Hive 客户端中操作如下：

CREATE TABLE test_iceberg_tbl4  (  id int,   name string,  age int,  dt string)STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler' LOCATION 'hdfs://mycluster/spark/person' TBLPROPERTIES ('iceberg.catalog'='location_based_table');
注意：指定的location路径下必须是iceberg格式表数据，并且需要有元数据目录才可以。不能将其他数据映射到Hive iceberg格式表。

复制代码

注意：由于 Hive 建表语句分区语法“Partitioned by”的限制,如果使用 Hive 创建 Iceberg 格式表，目前只能按照 Hive 语法来写，底层转换成 Iceberg 标识分区，这种情况下不能使用 Iceberge 的分区转换，例如：days(timestamp)，如果想要使用 Iceberg 格式表的分区转换标识分区，需要使用 Spark 或者 Flink 引擎创建表。

发布于: 刚刚阅读数: 4

原文链接:【http://xie.infoq.cn/article/631a1a89b651786116d067b70】。文章转载请联系作者。

Lansonli

关注

微信公众号：三帮大数据 2022-07-12 加入

CSDN大数据领域博客专家，华为云享专家、阿里云专家博主、腾云先锋（TDP）核心成员、51CTO专家博主，全网六万多粉丝，知名互联网公司大数据高级开发工程师

发布

暂无评论

创作场景

数据湖（十）：Hive 与 Iceberg 整合

Hive 与 Iceberg 整合

一、开启 Hive 支持 Iceberg

1、下载 iceberg-hive-runtime.jar

2、配置 hive-site.xml

二、Hive 中操作 Iceberg 格式表

1、如果没有设置 iceberg.catalog 属性，默认使用 HiveCatalog 来加载

2、如果设置了 iceberg.catalog 对应的 catalog 名字，就用对应类型的 catalog 加载

3、如果 iceberg.catalog 属性设置为“location_based_table”,可以从指定的根路径下加载 Iceberg 表

Lansonli

评论