Hive 对分区分桶表的操作

关注

发布于: 2021 年 05 月 25 日

对分区表操作

在大数据中，最常用的一种思想就是分治，我们可以把大的文件切割划分成一个个的小的文件，这样每次操作一个小的文件就会很容易了，同样的道理，在 hive 当中也是支持这种思想的，就是我们可以把大的数据，按照每天，或者每小时进行切分成一个个的小的文件，这样去操作小的文件就会容易得多了

创建分区表的语法:

hive (myhive)> create table score(s_id string,c_id string, s_score int) partitioned by (month string) row format delimited fields terminated by '\t';

复制代码

创建一个表带多个分区

hive (myhive)> create table score2 (s_id string,c_id string, s_score int) partitioned by (year string,month string,day string) row format delimited fields terminated by '\t';

复制代码

加载数据到分区表中

hive (myhive)> load data local inpath '/export/servers/hivedatas/score.csv' into table score partition (month='201806');

复制代码

加载数据到一个多分区的表中去

hive (myhive)> load data local inpath '/export/servers/hivedatas/score.csv' into table score2 partition(year='2018',month='06',day='01');

复制代码

查看分区:

hive (myhive)> show  partitions  score;

复制代码

添加一个分区

hive (myhive)> alter table score add partition(month='201805');

复制代码

同时添加多个分区

hive (myhive)> alter table score add partition(month='201804') partition(month = '201803');

复制代码

注意：添加分区之后就可以在 hdfs 文件系统当中看到表下面多了一个文件夹

删除分区:

hive (myhive)> alter table score drop partition(month = '201806');

复制代码

关于外部分区表综合练习:

需求描述：现在有一个文件 score.csv 文件，存放在集群的这个目录下/scoredatas/day=20180607，这个文件每天都会生成，存放到对应的日期文件夹下面去，文件别人也需要公用，不能移动。需求，创建 hive 对应的表，并将数据加载到表中，进行数据统计分析，且删除表之后，数据不能删除

数据准备:

hdfs dfs -mkdir -p /scoredatas/day=20180607hdfs dfs -mkdir -p /scoredatas/day=20180608hdfs dfs -put score.csv /scoredatas/day=20180607/hdfs dfs -put score.csv /scoredatas/day=20180608/

复制代码

实现步骤:

hive (myhive)> create external table score4(s_id string, c_id string,s_score int) partitioned by (day string) row format delimited fields terminated by '\t' location '/scoredatas';

复制代码

进行表的修复,说白了就是建立我们表与我们数据文件之间的一个关系映射

hive (myhive)> msck  repair   table  score4;   修复成功之后即可看到数据已经全部加载到表当中去了  除了通过修复来建立关系映射, 也可以手动添加分区实现  alter table score4 add partition(day='20180607');

复制代码

对分桶表操作

将数据按照指定的字段进行分成多个桶中去，说白了就是将数据按照字段进行划分，可以将数据按照字段划分到多个文件当中去

开启 hive 的捅表功能

hive (myhive)> set hive.enforce.bucketing=true;

复制代码

设置 reduce 的个数

hive (myhive)> set mapreduce.job.reduces=3;

复制代码

创建桶表

hive (myhive)> create table course (c_id string,c_name string,t_id string) clustered by(c_id) into 3 buckets row format delimited fields terminated by '\t';

复制代码

桶表的数据加载，由于桶表的数据加载通过 hdfs dfs -put 文件或者通过 load data 均不好使，只能通过 insert overwrite

创建普通表，并通过 insert overwrite 的方式将普通表的数据通过查询的方式加载到桶表当中去

创建普通表

hive (myhive)> create table course_common (c_id string,c_name string,t_id string) row format delimited fields terminated by '\t';

复制代码

普通表中加载数据

hive (myhive)> load data local inpath '/export/servers/hivedatas/course.csv' into table course_common;

复制代码

通过 insert overwrite 给桶表中加载数据

hive (myhive)> insert overwrite table course select * from course_common cluster by(c_id);

复制代码

发布于: 2021 年 05 月 25 日阅读数: 17

原文链接:【http://xie.infoq.cn/article/e51c445ef9d0a6e6cb67cdddc】。

五分钟学大数据

关注

专注于大数据技术研究 2020.11.10 加入

运营公众号：五分钟学大数据。大数据领域原创技术号，深入大数据技术

发布

暂无评论

创作场景

Hive 对分区分桶表的操作

​对分区表操作

对分桶表操作

五分钟学大数据

评论

对分区表操作