大家好,我是怀瑾握瑜,一只大数据萌新,家有两只吞金兽,嘉与嘉,上能 code 下能 teach 的全能奶爸
如果您喜欢我的文章,可以[关注⭐]+[点赞👍]+[评论📃],您的三连是我前进的动力,期待与您共同成长~
1. 创建表的时候从其他表直接导入
该方法是创建新表的同时,直接读取旧表的字段和数据,常见的应用场景就是快速抽取数据做测试用,详情
0: jdbc:hive2://xxx.hadoop.com:2181,xxx2.h> CREATE TABLE if not exists testA(
. . . . . . . . . . . . . . . . . . . . . . .> id string comment '',
. . . . . . . . . . . . . . . . . . . . . . .> name string COMMENT ''
. . . . . . . . . . . . . . . . . . . . . . .> )partitioned by (pdt int)
. . . . . . . . . . . . . . . . . . . . . . .> STORED AS PARQUET
. . . . . . . . . . . . . . . . . . . . . . .> TBLPROPERTIES ("parquet.compression"="SNAPPY");
0: jdbc:hive2://xxx.hadoop.com:2181,xxx2.h> create table testB as select id, name from testA;
INFO : Compiling command(queryId=hive_20211123142151_32d40208-47c8-4f73-a9ec-de5aad51807c): create table testB as select id, name from testA
INFO : Semantic Analysis Completed (retrial = false)
INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:id, type:string, comment:null), FieldSchema(name:name, type:string, comment:null)], properties:null)
INFO : Completed compiling command(queryId=hive_20211123142151_32d40208-47c8-4f73-a9ec-de5aad51807c); Time taken: 0.414 seconds
INFO : Executing command(queryId=hive_20211123142151_32d40208-47c8-4f73-a9ec-de5aad51807c): create table testB as select id, name from testA
INFO : Query ID = hive_20211123142151_32d40208-47c8-4f73-a9ec-de5aad51807c
INFO : Total jobs = 1
INFO : Launching Job 1 out of 1
INFO : Starting task [Stage-1:MAPRED] in serial mode
INFO : Subscribed to counters: [] for queryId: hive_20211123142151_32d40208-47c8-4f73-a9ec-de5aad51807c
INFO : Tez session hasn't been created yet. Opening session
INFO : Dag name: create table testB as select id, nam...testA (Stage-1)
INFO : Status: Running (Executing on YARN cluster with App id application_1637046596410_10638)
----------------------------------------------------------------------------------------------
VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
----------------------------------------------------------------------------------------------
Map 1 container SUCCEEDED 0 0 0 0 0 0
Reducer 2 ...... container SUCCEEDED 1 1 0 0 0 0
----------------------------------------------------------------------------------------------
VERTICES: 01/02 [==========================>>] 100% ELAPSED TIME: 8.63 s
----------------------------------------------------------------------------------------------
INFO : Status: DAG finished successfully in 7.96 seconds
INFO :
INFO : Query Execution Summary
INFO : ----------------------------------------------------------------------------------------------
INFO : OPERATION DURATION
INFO : ----------------------------------------------------------------------------------------------
INFO : Compile Query 0.41s
INFO : Prepare Plan 9.99s
INFO : Get Query Coordinator (AM) 0.00s
INFO : Submit Plan 0.42s
INFO : Start DAG 1.16s
INFO : Run DAG 7.96s
INFO : ----------------------------------------------------------------------------------------------
INFO :
INFO : Task Execution Summary
INFO : ----------------------------------------------------------------------------------------------
INFO : VERTICES DURATION(ms) CPU_TIME(ms) GC_TIME(ms) INPUT_RECORDS OUTPUT_RECORDS
INFO : ----------------------------------------------------------------------------------------------
INFO : Map 1 0.00 0 0 0 0
INFO : Reducer 2 2306.00 4,180 89 0 0
INFO : ----------------------------------------------------------------------------------------------
INFO : OK
No rows affected (20.638 seconds)
复制代码
2. hive 表导入到 hive 表
将 A 表数据导入到 B 表
INSERT INTO TABLE testB select id, name from testA where id = 1;
复制代码
2.1 技巧 1:插入数据
关键字:insert into table
该方法多用于插入新的数据,比如定时或者实时落数据的时候,或者多次迁移历史数据
INSERT INTO TABLE testB select id, name from testA where id = 1;
复制代码
2.2 技巧 2:覆盖数据
关键字:insert overwrite table
该方法多用于天表等,按一定周期去运行,需要对数据进行覆盖操作的场景
INSERT OVERWRITE TABLE testB select id, name from testA where id = 1;
复制代码
2.3 技巧 3:分区数据
关键字:insert overwrite table partition(xxx=xxx)
该方法多用于插入的数据的分区是查询中的某个字段,并且可以做一定的处理,比如取模等操作
INSERT OVERWRITE TABLE testB partition(pdt=20210101) select id, name,pdt from testA where id = 1;
复制代码
2.4 技巧 4:导入压缩数据(重要)
实际场景中,hive 的表大多是有压缩的,没法直接从外部的文件直接导入到系统中,这里就需要一个折中的办法
关键字:
1.临时表
2.分区加入到字段中
3.将临时表数据导入到正式表,指定分区字段
接下来详细举例一下
假如原始表 testA,这是一张带分区 pdt,并且用 snappy 压缩的表
CREATE TABLE if not exists testA(
id string comment '',
name string COMMENT ''
)partitioned by (pdt int)
STORED AS PARQUET
TBLPROPERTIES ("parquet.compression"="SNAPPY");
复制代码
想导入的数据是 excel,或者是需要造一些数据,直接用 insert 插入,效率太低,尤其是大批量的
所以我们使用一张临时表,testA_tmp,分区字段 pdt 当做一个字段,字段用“,”分隔,文本文件,方便导入
CREATE TABLE if not exists testA_tmp(
id string comment '',
name string COMMENT '',
pdt int COMMENT ""
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
复制代码
准备好一个文本文件存入数据 tmp.txt
tmp.txt:
1,2,20210101
3,4,20210101
复制代码
上传 hdfs,导入到临时表中
hadoop fs -put tmp.txt /dataTmp/
load data inpath '/dataTmp/tmp.txt' into table testA_tmp;
复制代码
hive 命令行,设置动态分区,执行插入语句
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
INSERT OVERWRITE TABLE testA_tmp PARTITION (pdt) SELECT * FROM testA;
复制代码
至此大功告成,导入到目标表,并且带有分区
结束语
如果您喜欢我的文章,可以[关注⭐]+[点赞👍]+[评论📃],您的三连是我前进的动力,期待与您共同成长~
可关注公众号【怀瑾握瑜的嘉与嘉】,获取资源下载方式
评论