1. 单表查询
语法格式:
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list [HAVING condition]]
[CLUSTER BY col_list
| [DISTRIBUTE BY col_list] [SORT BY| ORDER BY col_list]
]
[LIMIT number]
复制代码
注:
1、order by 会对输入做全局排序,因此只有一个 reducer,会导致当输入规模较大时,需要较长的计算时间。
2、sort by 不是全局排序,其在数据进入 reducer 前完成排序。因此,如果用 sort by 进行排序,并且设置 mapred.reduce.tasks>1,则 sort by 只保证每个 reducer 的输出有序,不保证全局有序。
3、distribute by(字段)根据指定的字段将数据分到不同的 reducer,且分发算法是 hash 散列。
4、Cluster by(字段) 除了具有 Distribute by 的功能外,还会对该字段进行排序。
因此,如果分桶和 sort 字段是同一个时,此时,cluster by = distribute by + sort by
分桶表的作用:最大的作用是用来提高 join 操作的效率;
(思考这个问题:select a.id,a.name,b.addr from a join b on a.id = b.id;如果 a 表和 b 表已经是分桶表,而且分桶的字段是 id 字段做这个 join 操作时,还需要全表做笛卡尔积吗?)
hive (myhive)> select * from score;
复制代码
hive (myhive)> select s_id ,c_id from score;
复制代码
hive (myhive)> select s_id as myid ,c_id from score;
复制代码
1)求总行数(count)
hive (myhive)> select count(1) from score;
2)求分数的最大值(max)
hive (myhive)> select max(s_score) from score;
3)求分数的最小值(min)
hive (myhive)> select min(s_score) from score;
4)求分数的总和(sum)
hive (myhive)> select sum(s_score) from score;
5)求分数的平均值(avg)
hive (myhive)> select avg(s_score) from score;
复制代码
hive (myhive)> select * from score limit 3;
复制代码
hive (myhive)> select * from score where s_score > 60;
复制代码
2. 多表 join 查询
查询分数对应的姓名
hive (myhive)> SELECT s.s_id,s.s_score,stu.s_name,stu.s_birth FROM score s LEFT JOIN student stu ON s.s_id = stu.s_id;
同时也可以为表起别名
hive (myhive)> select * from techer t join course c on t.t_id = c.t_id;
复制代码
hive (myhive)> select * from techer t [inner] join course c on t.t_id = c.t_id;
复制代码
hive (myhive)> select * from techer t left join course c on t.t_id = c.t_id;
复制代码
hive (myhive)> select * from techer t right join course c on t.t_id = c.t_id;
复制代码
hive (myhive)> SELECT * FROM techer t FULL JOIN course c ON t.t_id = c.t_id ;
复制代码
多表连接查询,查询老师对应的课程,以及对应的分数,对应的学生
hive (myhive)> select * from techer t
> left join course c
> on t.t_id = c.t_id
> left join score s
> on s.c_id = c.c_id
> left join student stu
> on s.s_id = stu.s_id;
复制代码
大多数情况下,Hive 会对每对 JOIN 连接对象启动一个 MapReduce 任务。本例中会首先启动一个 MapReduce job 对表 techer 和表 course 进行连接操作,然后会再启动一个 MapReduce job 将第一个 MapReduce job 的输出和表 score;进行连接操作
3 排序操作 Order By
hive (myhive)> SELECT * FROM student s LEFT JOIN score sco ON s.s_id = sco.s_id ORDER BY sco.s_score DESC;
复制代码
hive (myhive)> SELECT * FROM student s LEFT JOIN score sco ON s.s_id = sco.s_id ORDER BY sco.s_score asc;
复制代码
hive (myhive)> select s_id ,avg(s_score) avg from score group by s_id order by avg;
复制代码
hive (myhive)> select s_id ,avg(s_score) avg from score group by s_id order by s_id,avg;
复制代码
设置reduce个数
hive (myhive)> set mapreduce.job.reduces=3;
查看设置reduce个数
hive (myhive)> set mapreduce.job.reduces;
查询成绩按照成绩降序排列
hive (myhive)> select * from score sort by s_score;
将查询结果导入到文件中(按照成绩降序排列)
hive (myhive)> insert overwrite local directory '/export/servers/hivedatas/sort' select * from score sort by s_score;
复制代码
评论