写点什么

最强最全面的大数据 SQL 面试系列

  • 2021 年 12 月 28 日
  • 本文字数:16818 字

    阅读完需:约 55 分钟

本套 SQL 题的答案是由许多小伙伴共同贡献的,1+1 的力量是远远大于 2 的,有不少题目都采用了非常巧妙的解法,也有不少题目有多种解法。本套大数据 SQL 题不仅题目丰富多样,答案更是精彩绝伦!


注:以下参考答案都经过简单数据场景进行测试通过,但并未测试其他复杂情况。本文档的 SQL 主要使用 Hive SQL


因内容较多,带目录的 PDF 查看是比较方便的


最强最全面的大数据SQL经典面试题完整PDF版

一、行列转换

描述:表中记录了各年份各部门的平均绩效考核成绩。


表名:t1


表结构:


a -- 年份b -- 部门c -- 绩效得分
复制代码


表内容


 a   b  c2014  B  92015  A  82014  A  102015  B  7
复制代码

问题一:多行转多列

问题描述:将上述表内容转为如下输出结果所示:


 a  col_A col_B2014  10   92015  8    7
复制代码


参考答案


select     a,    max(case when b="A" then c end) col_A,    max(case when b="B" then c end) col_Bfrom t1group by a;
复制代码

问题二:如何将结果转成源表?(多列转多行)

问题描述:将问题一的结果转成源表,问题一结果表名为t1_2


参考答案


select     a,    b,    cfrom (    select a,"A" as b,col_a as c from t1_2     union all     select a,"B" as b,col_b as c from t1_2  )tmp; 
复制代码

问题三:同一部门会有多个绩效,求多行转多列结果

问题描述:2014 年公司组织架构调整,导致部门出现多个绩效,业务及人员不同,无法合并算绩效,源表内容如下:


2014  B  92015  A  82014  A  102015  B  72014  B  6
复制代码


输出结果如下所示


 a    col_A  col_B2014   10    6,92015   8     7
复制代码


参考答案:


select     a,    max(case when b="A" then c end) col_A,    max(case when b="B" then c end) col_Bfrom (    select         a,        b,        concat_ws(",",collect_set(cast(c as string))) as c    from t1    group by a,b)tmpgroup by a;
复制代码

二、排名中取他值

表名t2


表字段及内容


a    b   c2014  A   32014  B   12014  C   22015  A   42015  D   3
复制代码

问题一:按 a 分组取 b 字段最小时对应的 c 字段

输出结果如下所示


a   min_c2014  32015  4
复制代码


参考答案:


select  a,  c as min_cfrom(      select        a,        b,        c,        row_number() over(partition by a order by b) as rn       from t2 )awhere rn = 1;
复制代码

问题二:按 a 分组取 b 字段排第二时对应的 c 字段

输出结果如下所示


 a  second_c2014  12015  3
复制代码


参考答案


select  a,  c as second_cfrom(      select        a,        b,        c,        row_number() over(partition by a order by b) as rn       from t2 )awhere rn = 2;
复制代码

问题三:按 a 分组取 b 字段最小和最大时对应的 c 字段

输出结果如下所示


a    min_c  max_c2014  3      22015  4      3
复制代码


参考答案:


select  a,  min(if(asc_rn = 1, c, null)) as min_c,  max(if(desc_rn = 1, c, null)) as max_cfrom(      select        a,        b,        c,        row_number() over(partition by a order by b) as asc_rn,        row_number() over(partition by a order by b desc) as desc_rn       from t2 )awhere asc_rn = 1 or desc_rn = 1group by a; 
复制代码

问题四:按 a 分组取 b 字段第二小和第二大时对应的 c 字段

输出结果如下所示


a    min_c  max_c2014  1      12015  3      4
复制代码


参考答案


select    ret.a    ,max(case when ret.rn_min = 2 then ret.c else null end) as min_c    ,max(case when ret.rn_max = 2 then ret.c else null end) as max_cfrom (    select        *        ,row_number() over(partition by t2.a order by t2.b) as rn_min        ,row_number() over(partition by t2.a order by t2.b desc) as rn_max    from t2) as retwhere ret.rn_min = 2or ret.rn_max = 2group by ret.a;
复制代码

问题五:按 a 分组取 b 字段前两小和前两大时对应的 c 字段

注意:需保持 b 字段最小、最大排首位


输出结果如下所示


a    min_c  max_c2014  3,1     2,12015  4,3     3,4
复制代码


参考答案


select  tmp1.a as a,  min_c,  max_cfrom (  select     a,    concat_ws(',', collect_list(c)) as min_c  from    (     select       a,       b,       c,       row_number() over(partition by a order by b) as asc_rn     from t2     )a    where asc_rn <= 2     group by a )tmp1 join (  select     a,    concat_ws(',', collect_list(c)) as max_c  from    (     select        a,        b,        c,        row_number() over(partition by a order by b desc) as desc_rn      from t2    )a    where desc_rn <= 2    group by a )tmp2 on tmp1.a = tmp2.a; 
复制代码

三、累计求值

表名t3


表字段及内容


a    b   c2014  A   32014  B   12014  C   22015  A   42015  D   3
复制代码

问题一:按 a 分组按 b 字段排序,对 c 累计求和

输出结果如下所示


a    b   sum_c2014  A   32014  B   42014  C   62015  A   42015  D   7
复制代码


参考答案


select   a,   b,   c,   sum(c) over(partition by a order by b) as sum_cfrom t3; 
复制代码

问题二:按 a 分组按 b 字段排序,对 c 取累计平均值

输出结果如下所示


a    b   avg_c2014  A   32014  B   22014  C   22015  A   42015  D   3.5
复制代码


参考答案


select   a,   b,   c,   avg(c) over(partition by a order by b) as avg_cfrom t3;
复制代码

问题三:按 a 分组按 b 字段排序,对 b 取累计排名比例

输出结果如下所示


a    b   ratio_c2014  A   0.332014  B   0.672014  C   1.002015  A   0.502015  D   1.00
复制代码


参考答案


select   a,   b,   c,   round(row_number() over(partition by a order by b) / (count(c) over(partition by a)),2) as ratio_cfrom t3 order by a,b;
复制代码

问题四:按 a 分组按 b 字段排序,对 b 取累计求和比例

输出结果如下所示


a    b   ratio_c2014  A   0.502014  B   0.672014  C   1.002015  A   0.572015  D   1.00
复制代码


参考答案


select   a,   b,   c,   round(sum(c) over(partition by a order by b) / (sum(c) over(partition by a)),2) as ratio_cfrom t3 order by a,b;
复制代码

四、窗口大小控制

表名t4


表字段及内容


a    b   c2014  A   32014  B   12014  C   22015  A   42015  D   3
复制代码

问题一:按 a 分组按 b 字段排序,对 c 取前后各一行的和

输出结果如下所示


a    b   sum_c2014  A   12014  B   52014  C   12015  A   32015  D   4
复制代码


参考答案


select   a,  b,  lag(c,1,0) over(partition by a order by b)+lead(c,1,0) over(partition by a order by b) as sum_cfrom t4;
复制代码

问题二:按 a 分组按 b 字段排序,对 c 取平均值

问题描述:前一行与当前行的均值!


输出结果如下所示


a    b   avg_c2014  A   32014  B   22014  C   1.52015  A   42015  D   3.5
复制代码


参考答案


此处给出两种解法,其一:


select    a,    b,    avg(c) over(partition by a order by b rows between 1 preceding and current row )fromt4;
复制代码


其二:


select  a,  b,  case when lag_c is null then c  else (c+lag_c)/2 end as avg_cfrom ( select   a,   b,   c,   lag(c,1) over(partition by a order by b) as lag_c  from t4 )temp;
复制代码

五、产生连续数值

输出结果如下所示


12345...100
复制代码


参考答案


不借助其他任何外表,实现产生连续数值


此处给出两种解法,其一:


selectid_start+pos as idfrom(    select    1 as id_start,    1000000 as id_end) m  lateral view posexplode(split(space(id_end-id_start), '')) t as pos, val
复制代码


其二:


select  row_number() over() as idfrom    (select split(space(99), ' ') as x) tlateral viewexplode(x) ex;
复制代码


那如何产生 1 至 1000000 连续数值?


参考答案


select  row_number() over() as idfrom    (select split(space(999999), ' ') as x) tlateral viewexplode(x) ex;
复制代码

六、数据扩充与收缩

表名t6


表字段及内容


a324
复制代码

问题一:数据扩充

输出结果如下所示


a     b3   3、2、12   2、14   4、3、2、1
复制代码


参考答案


select    t.a,  concat_ws('、',collect_set(cast(t.rn as string))) as bfrom(    select      t6.a,    b.rn  from t6  left join  (    select     row_number() over() as rn   from     (select split(space(5), ' ') as x) t -- space(5)可根据t6表的最大值灵活调整   lateral view   explode(x) pe  ) b  on 1 = 1  where t6.a >= b.rn  order by t6.a, b.rn desc ) tgroup by  t.a;
复制代码

问题二:数据扩充,排除偶数

输出结果如下所示


a     b3   3、12   14   3、1
复制代码


参考答案


select    t.a,  concat_ws('、',collect_set(cast(t.rn as string))) as bfrom(    select      t6.a,    b.rn  from t6  left join  (    select     row_number() over() as rn   from     (select split(space(5), ' ') as x) t   lateral view   explode(x) pe  ) b  on 1 = 1  where t6.a >= b.rn and b.rn % 2 = 1  order by t6.a, b.rn desc ) tgroup by  t.a;
复制代码

问题三:如何处理字符串累计拼接

问题描述:将小于等于 a 字段的值聚合拼接起来


输出结果如下所示


a     b3     2、32     24     2、3、4
复制代码


参考答案


select    t.a,  concat_ws('、',collect_set(cast(t.a1 as string))) as bfrom(     select      t6.a,    b.a1  from t6  left join  (      select  a as a1    from t6  ) b  on 1 = 1  where t6.a >= b.a1  order by t6.a, b.a1 ) tgroup by  t.a;
复制代码

问题四:如果 a 字段有重复,如何实现字符串累计拼接

输出结果如下所示


a     b2     23     2、33     2、3、34     2、3、3、4
复制代码


参考答案


select   a,  bfrom ( select     t.a,   t.rn,   concat_ws('、',collect_list(cast(t.a1 as string))) as b from  (       select       a.a,     a.rn,     b.a1    from    (     select         a,       row_number() over(order by a ) as rn      from t6    ) a    left join    (        select  a as a1,     row_number() over(order by a ) as rn       from t6    ) b    on 1 = 1    where a.a >= b.a1 and a.rn >= b.rn     order by a.a, b.a1   ) t  group by  t.a,t.rn  order by t.a,t.rn) tt; 
复制代码

问题五:数据展开

问题描述:如何将字符串"1-5,16,11-13,9"扩展成"1,2,3,4,5,16,11,12,13,9"?注意顺序不变。


参考答案


select    concat_ws(',',collect_list(cast(rn as string)))from(  select     a.rn,   b.num,   b.pos  from   (    select     row_number() over() as rn    from (select split(space(20), ' ') as x) t -- space(20)可灵活调整    lateral view    explode(x) pe   ) a lateral view outer    posexplode(split('1-5,16,11-13,9', ',')) b as pos, num   where a.rn between cast(split(num, '-')[0] as int) and cast(split(num, '-')[1] as int) or a.rn = num   order by pos, rn ) t;
复制代码

七、合并与拆分

表名t7


表字段及内容


a    b2014  A2014  B2015  B2015  D
复制代码

问题一:合并

输出结果如下所示


2014  A、B2015  B、D
复制代码


参考答案:


select  a,  concat_ws('、', collect_set(t.b)) bfrom t7group by a;
复制代码

问题二:拆分

问题描述:将分组合并的结果拆分出来


参考答案


select  t.a,  dfrom( select  a,  concat_ws('、', collect_set(t7.b)) b from t7 group by a)tlateral view explode(split(t.b, '、')) table_tmp as d;
复制代码

八、模拟循环操作

表名t8


表字段及内容


a10110101
复制代码
问题一:如何将字符'1'的位置提取出来

输出结果如下所示:


1,3,42,4
复制代码


参考答案


select     a,    concat_ws(",",collect_list(cast(index as string))) as resfrom (    select         a,        index+1 as index,        chr    from (        select             a,            concat_ws(",",substr(a,1,1),substr(a,2,1),substr(a,3,1),substr(a,-1)) str        from t8    ) tmp1    lateral view posexplode(split(str,",")) t as index,chr    where chr = "1") tmp2group by a;
复制代码

九、不使用 distinct 或 group by 去重

表名t9


表字段及内容


a     b     c    d2014  2016  2014   A2014  2015  2015   B
复制代码

问题一:不使用 distinct 或 group by 去重

输出结果如下所示


2014  A2016  A2014  B2015  B
复制代码


参考答案


select  t2.year  ,t2.numfrom (  select    *    ,row_number() over (partition by t1.year,t1.num) as rank_1  from   (    select       a as year,      d as num    from t9    union all    select       b as year,      d as num    from t9    union all    select       c as year,      d as num    from t9   )t1)t2where rank_1=1order by num;
复制代码

十、容器--反转内容

表名t10


表字段及内容


aAB,CA,BADBD,EA
复制代码

问题一:反转逗号分隔的数据:改变顺序,内容不变

输出结果如下所示


BAD,CA,ABEA,BD
复制代码


参考答案


select   a,  concat_ws(",",collect_list(reverse(str)))from (  select     a,    str  from t10  lateral view explode(split(reverse(a),",")) t as str) tmp1group by a;
复制代码

问题二:反转逗号分隔的数据:改变内容,顺序不变

输出结果如下所示


BA,AC,DABDB,AE
复制代码


参考答案


select   a,  concat_ws(",",collect_list(reverse(str)))from (  select      a,     str  from t10  lateral view explode(split(a,",")) t as str) tmp1group by a;
复制代码

十一、多容器--成对提取数据

表名t11


表字段及内容


a       bA/B     1/3B/C/D   4/5/2
复制代码

问题一:成对提取数据,字段一一对应

输出结果如下所示


a       bA       1B       3B       4C       5D       2
复制代码


参考答案:


select   a_inx,  b_inxfrom (  select      a,     b,     a_id,     a_inx,     b_id,     b_inx  from t11  lateral view posexplode(split(a,'/')) t as a_id,a_inx  lateral view posexplode(split(b,'/')) t as b_id,b_inx) tmpwhere a_id=b_id;
复制代码

十二、多容器--转多行

表名t12


表字段及内容


a        b      c001     A/B     1/3/5002     B/C/D   4/5
复制代码

问题一:转多行

输出结果如下所示


a        d       e001     type_b    A001     type_b    B001     type_c    1001     type_c    3001     type_c    5002     type_b    B002     type_b    C002     type_b    D002     type_c    4002     type_c    5
复制代码


参考答案:


select   a,  d,  efrom (  select    a,    "type_b" as d,    str as e  from t12  lateral view explode(split(b,"/")) t as str  union all   select    a,    "type_c" as d,    str as e  from t12  lateral view explode(split(c,"/")) t as str) tmporder by a,d;
复制代码

十三、抽象分组--断点排序

表名t13


表字段及内容


a    b2014  12015  12016  12017  02018  02019  -12020  -12021  -12022  12023  1
复制代码

问题一:断点排序

输出结果如下所示


a    b    c 2014  1    12015  1    22016  1    32017  0    12018  0    22019  -1   12020  -1   22021  -1   32022  1    12023  1    2
复制代码


参考答案:


select    a,  b,  row_number() over( partition by b,repair_a order by a asc) as c--按照b列和[b的组首]分组,排序from (  select      a,    b,    a-b_rn as repair_a--根据b列值出现的次序,修复a列值为b首次出现的a列值,称为b的[组首]  from   (   select      a,     b,     row_number() over( partition by b order by  a  asc ) as b_rn--按b列分组,按a列排序,得到b列各值出现的次序   from t13   )tmp1)tmp2--注意,如果不同的b列值,可能出现同样的组首值,但组首值需要和a列值 一并参与分组,故并不影响排序。order by a asc; 
复制代码

十四、业务逻辑的分类与抽象--时效

日期表d_date


表字段及内容


date_id      is_work2017-04-13       12017-04-14       12017-04-15       02017-04-16       02017-04-17       1
复制代码


工作日:周一至周五 09:30-18:30


客户申请表t14


表字段及内容


a      b       c1     申请   2017-04-14 18:03:001     通过   2017-04-17 09:43:002     申请   2017-04-13 17:02:002     通过   2017-04-15 09:42:00
复制代码

问题一:计算上表中从申请到通过占用的工作时长

输出结果如下所示


a         d1        0.67h2       10.67h 
复制代码


参考答案:


select     a,    round(sum(diff)/3600,2) as dfrom (    select         a,        apply_time,        pass_time,        dates,        rn,        ct,        is_work,        case when is_work=1 and rn=1 then unix_timestamp(concat(dates,' 18:30:00'),'yyyy-MM-dd HH:mm:ss')-unix_timestamp(apply_time,'yyyy-MM-dd HH:mm:ss')            when is_work=0 then 0            when is_work=1 and rn=ct then unix_timestamp(pass_time,'yyyy-MM-dd HH:mm:ss')-unix_timestamp(concat(dates,' 09:30:00'),'yyyy-MM-dd HH:mm:ss')            when is_work=1 and rn!=ct then 9*3600        end diff    from (        select             a,            apply_time,            pass_time,            time_diff,            day_diff,            rn,            ct,            date_add(start,rn-1) dates        from (            select                 a,                apply_time,                pass_time,                time_diff,                day_diff,                strs,                start,                row_number() over(partition by a) as rn,                count(*) over(partition by a) as ct            from (                select                     a,                    apply_time,                    pass_time,                    time_diff,                    day_diff,                    substr(repeat(concat(substr(apply_time,1,10),','),day_diff+1),1,11*(day_diff+1)-1) strs                from (                    select                         a,                        apply_time,                        pass_time,                        unix_timestamp(pass_time,'yyyy-MM-dd HH:mm:ss')-unix_timestamp(apply_time,'yyyy-MM-dd HH:mm:ss') time_diff,                        datediff(substr(pass_time,1,10),substr(apply_time,1,10)) day_diff                    from (                        select                             a,                            max(case when b='申请' then c end) apply_time,                            max(case when b='通过' then c end) pass_time                        from t14                        group by a                    ) tmp1                ) tmp2            ) tmp3             lateral view explode(split(strs,",")) t as start        ) tmp4    ) tmp5    join d_date     on tmp5.dates = d_date.date_id) tmp6group by a;
复制代码

十五、时间序列--进度及剩余

表名t15


表字段及内容


date_id      is_work2017-07-30      02017-07-31      12017-08-01      12017-08-02      12017-08-03      12017-08-04      12017-08-05      02017-08-06      02017-08-07      1
复制代码

问题一:求每天的累计周工作日,剩余周工作日

输出结果如下所示


date_id      week_to_work  week_left_work2017-07-31      1             42017-08-01      2             32017-08-02      3             22017-08-03      4             12017-08-04      5             02017-08-05      5             02017-08-06      5             0
复制代码


参考答案:


此处给出两种解法,其一:


select  date_id,case date_format(date_id,'u')    when 1 then 1    when 2 then 2     when 3 then 3     when 4 then 4    when 5 then 5     when 6 then 5     when 7 then 5  end as week_to_work,case date_format(date_id,'u')    when 1 then 4    when 2 then 3      when 3 then 2     when 4 then 1    when 5 then 0     when 6 then 0     when 7 then 0  end as week_to_workfrom t15
复制代码


其二:


selectdate_id,week_to_work,week_sum_work-week_to_work as week_left_workfrom(    select    date_id,    sum(is_work) over(partition by year,week order by date_id) as week_to_work,    sum(is_work) over(partition by year,week) as week_sum_work    from(        select        date_id,        is_work,        year(date_id) as year,        weekofyear(date_id) as week        from t15    ) ta) tb order by date_id;
复制代码

十六、时间序列--构造日期

问题一:直接使用 SQL 实现一张日期维度表,包含以下字段:

date                  string                日期d_week                string                年内第几周weeks                 int                   周几w_start               string                周开始日w_end                 string                周结束日d_month              int                   第几月m_start              string                月开始日m_end                string                月结束日d_quarter            int                    第几季q_start              string                季开始日q_end                string                季结束日d_year               int                    年份y_start              string                年开始日y_end                string                年结束日
复制代码


参考答案


drop table if exists dim_date;create table if not exists dim_date(    `date` string comment '日期',    d_week string comment '年内第几周',    weeks string comment '周几',    w_start string comment '周开始日',    w_end string comment '周结束日',    d_month string comment '第几月',    m_start string comment '月开始日',    m_end string comment '月结束日',    d_quarter int comment '第几季',    q_start string comment '季开始日',    q_end string comment '季结束日',    d_year int comment '年份',    y_start string comment '年开始日',    y_end string comment '年结束日');--自然月: 指每月的1号到那个月的月底,它是按照阳历来计算的。就是从每月1号到月底,不管这个月有30天,31天,29天或者28天,都算是一个自然月。
insert overwrite table dim_dateselect `date` , d_week --年内第几周 , case weekid when 0 then '周日' when 1 then '周一' when 2 then '周二' when 3 then '周三' when 4 then '周四' when 5 then '周五' when 6 then '周六' end as weeks -- 周 , date_add(next_day(`date`,'MO'),-7) as w_start --周一 , date_add(next_day(`date`,'MO'),-1) as w_end -- 周日_end -- 月份日期 , concat('第', monthid, '月') as d_month , m_start , m_end
-- 季节 , quarterid as d_quart , concat(d_year, '-', substr(concat('0', (quarterid - 1) * 3 + 1), -2), '-01') as q_start --季开始日 , date_sub(concat(d_year, '-', substr(concat('0', (quarterid) * 3 + 1), -2), '-01'), 1) as q_end --季结束日 -- 年 , d_year , y_start , y_end

from ( select `date` , pmod(datediff(`date`, '2012-01-01'), 7) as weekid --获取周几 , cast(substr(`date`, 6, 2) as int) as monthid --获取月份 , case when cast(substr(`date`, 6, 2) as int) <= 3 then 1 when cast(substr(`date`, 6, 2) as int) <= 6 then 2 when cast(substr(`date`, 6, 2) as int) <= 9 then 3 when cast(substr(`date`, 6, 2) as int) <= 12 then 4 end as quarterid --获取季节 可以直接使用 quarter(`date`) , substr(`date`, 1, 4) as d_year -- 获取年份 , trunc(`date`, 'YYYY') as y_start --年开始日 , date_sub(trunc(add_months(`date`, 12), 'YYYY'), 1) as y_end --年结束日 , date_sub(`date`, dayofmonth(`date`) - 1) as m_start --当月第一天 , last_day(date_sub(`date`, dayofmonth(`date`) - 1)) m_end --当月最后一天 , weekofyear(`date`) as d_week --年内第几周 from ( -- '2021-04-01'是开始日期, '2022-03-31'是截止日期 select date_add('2021-04-01', t0.pos) as `date` from ( select posexplode( split( repeat('o', datediff( from_unixtime(unix_timestamp('2022-03-31', 'yyyy-mm-dd'), 'yyyy-mm-dd'), '2021-04-01')), 'o' ) ) ) t0 ) t1 ) t2;
复制代码

十七、时间序列--构造累积日期

表名t17


表字段及内容


date_id2017-08-012017-08-022017-08-03
复制代码

问题一:每一日期,都扩展成月初至当天

输出结果如下所示


date_id    date_to_day2017-08-01   2017-08-012017-08-02   2017-08-012017-08-02   2017-08-022017-08-03   2017-08-012017-08-03   2017-08-022017-08-03   2017-08-03
复制代码


这种累积相关的表,常做桥接表。


参考答案:


select  date_id,  date_add(date_start_id,pos) as date_to_dayfrom(  select    date_id,    date_sub(date_id,dayofmonth(date_id)-1) as date_start_id  from t17) m  lateral view posexplode(split(space(datediff(from_unixtime(unix_timestamp(date_id,'yyyy-MM-dd')),from_unixtime(unix_timestamp(date_start_id,'yyyy-MM-dd')))), '')) t as pos, val;
复制代码

十八、时间序列--构造连续日期

表名t18


表字段及内容


a             b         c101        2018-01-01     10101        2018-01-03     20101        2018-01-06     40102        2018-01-02     20102        2018-01-04     30102        2018-01-07     60
复制代码

问题一:构造连续日期

问题描述:将表中数据的 b 字段扩充至范围[2018-01-01, 2018-01-07],并累积对 c 求和。


b 字段的值是较稀疏的。


输出结果如下所示


a             b          c      d101        2018-01-01     10     10101        2018-01-02      0     10101        2018-01-03     20     30101        2018-01-04      0     30101        2018-01-05      0     30101        2018-01-06     40     70101        2018-01-07      0     70102        2018-01-01      0      0102        2018-01-02     20     20102        2018-01-03      0     20102        2018-01-04     30     50102        2018-01-05      0     50102        2018-01-06      0     50102        2018-01-07     60    110
复制代码


参考答案:


select  a,  b,  c,  sum(c) over(partition by a order by b) as dfrom(  select  t1.a,  t1.b,  case    when t18.b is not null then t18.c    else 0  end as c  from  (    select    a,    date_add(s,pos) as b    from    (      select        a,        '2018-01-01' as s,        '2018-01-07' as r      from (select a from t18 group by a) ta    ) m  lateral view       posexplode(split(space(datediff(from_unixtime(unix_timestamp(r,'yyyy-MM-dd')),from_unixtime(unix_timestamp(s,'yyyy-MM-dd')))), '')) t as pos, val  ) t1    left join t18    on  t1.a = t18.a and t1.b = t18.b) ts;
复制代码

十九、时间序列--取多个字段最新的值

表名t19


表字段及内容


date_id   a   b    c2014     AB  12    bc2015         23    2016               d2017     BC 
复制代码

问题一:如何一并取出最新日期

输出结果如下所示


date_a   a    date_b    b    date_c   c2017    BC    2015     23    2016    d
复制代码


参考答案:


此处给出三种解法,其一:


SELECT  max(CASE WHEN rn_a = 1 THEN date_id else 0 END) AS date_a        ,max(CASE WHEN rn_a = 1 THEN a else null END) AS a        ,max(CASE WHEN rn_b = 1 THEN date_id else 0 END) AS date_b        ,max(CASE WHEN rn_b = 1 THEN b else NULL  END) AS b        ,max(CASE WHEN rn_c = 1 THEN date_id  else 0 END) AS date_c        ,max(CASE WHEN rn_c = 1 THEN c else null END) AS cFROM    (            SELECT  date_id                    ,a                    ,b                    ,c                    --对每列上不为null的值  的 日期 进行排序                    ,row_number()OVER( PARTITION BY 1 ORDER BY CASE WHEN a IS NULL THEN 0 ELSE date_id END DESC) AS rn_a                    ,row_number()OVER(PARTITION BY 1 ORDER BY CASE WHEN b IS NULL THEN 0 ELSE date_id END DESC) AS rn_b                    ,row_number()OVER(PARTITION BY 1 ORDER BY CASE WHEN c IS NULL THEN 0 ELSE date_id END DESC) AS rn_c            FROM    t19        ) tWHERE   t.rn_a = 1OR      t.rn_b = 1OR      t.rn_c = 1;
复制代码


其二:


SELECT     a.date_id  ,a.a  ,b.date_id  ,b.b  ,c.date_id  ,c.cFROM(   SELECT        t.date_id,      t.a   FROM     (     SELECT         t.date_id       ,t.a       ,t.b       ,t.c     FROM t19 t INNER JOIN    t19 t1 ON t.date_id = t1.date_id AND t.a IS NOT NULL   ) t   ORDER BY t.date_id DESC   LIMIT 1) aLEFT JOIN (  SELECT      t.date_id    ,t.b  FROM      (    SELECT        t.date_id      ,t.b    FROM t19 t INNER JOIN t19 t1 ON t.date_id = t1.date_id AND t.b IS NOT NULL  ) t  ORDER BY t.date_id DESC  LIMIT 1) b ON 1 = 1 LEFT JOIN(  SELECT      t.date_id    ,t.c  FROM      (    SELECT        t.date_id      ,t.c    FROM t19 t INNER JOIN t19 t1 ON t.date_id = t1.date_id AND t.c IS NOT NULL  ) t  ORDER BY t.date_id DESC  LIMIT   1) cON 1 = 1;
复制代码


其三:


select   * from  (  select t1.date_id as date_a,t1.a from (select t1.date_id,t1.a  from t19 t1 where t1.a is not null) t1  inner join (select max(t1.date_id) as date_id   from t19 t1 where t1.a is not null) t2  on t1.date_id=t2.date_id) t1cross join(  select t1.date_b,t1.b from (select t1.date_id as date_b,t1.b  from t19 t1 where t1.b is not null) t1  inner join (select max(t1.date_id) as date_id   from t19 t1 where t1.b is not null)t2  on t1.date_b=t2.date_id) t2cross join (  select t1.date_c,t1.c from (select t1.date_id as date_c,t1.c  from t19 t1 where t1.c is not null) t1  inner join (select max(t1.date_id) as date_id   from t19 t1 where t1.c is not null)t2  on t1.date_c=t2.date_id) t3;
复制代码

二十、时间序列--补全数据

表名t20


表字段及内容


date_id   a   b    c2014     AB  12    bc2015         23    2016               d2017     BC 
复制代码

问题一:如何使用最新数据补全表格

输出结果如下所示


date_id   a   b    c2014     AB  12    bc2015     AB  23    bc2016     AB  23    d2017     BC  23    d
复制代码


参考答案:


select   date_id,   first_value(a) over(partition by aa order by date_id) as a,  first_value(b) over(partition by bb order by date_id) as b,  first_value(c) over(partition by cc order by date_id) as cfrom(  select     date_id,    a,    b,    c,    count(a) over(order by date_id) as aa,    count(b) over(order by date_id) as bb,    count(c) over(order by date_id) as cc  from t20)tmp1;
复制代码

二十一、时间序列--取最新完成状态的前一个状态

表名t21


表字段及内容


date_id   a    b2014     1    A2015     1    B2016     1    A2017     1    B2013     2    A2014     2    B2015     2    A2014     3    A2015     3    A2016     3    B2017     3    A
复制代码


上表中 B 为完成状态

问题一:取最新完成状态的前一个状态

输出结果如下所示


date_id  a    b2016     1    A2013     2    A2015     3    A
复制代码


参考答案:


此处给出两种解法,其一:


select    t21.date_id,    t21.a,    t21.bfrom    (        select            max(date_id) date_id,            a        from            t21        where            b = 'B'        group by            a    ) t1    inner join t21 on t1.date_id -1 = t21.date_idand t1.a = t21.a;
复制代码


其二:


select  next_date_id as date_id  ,a  ,next_b as bfrom(  select    *,min(nk) over(partition by a,b) as minb  from(    select      *,row_number() over(partition by a order by date_id desc) nk      ,lead(date_id) over(partition by a order by date_id desc) next_date_id      ,lead(b) over(partition by a order by date_id desc) next_b    from(      select * from t21    ) t  ) t) twhere minb = nk and b = 'B';
复制代码

问题二:如何将完成状态的过程合并

输出结果如下所示:


a   b_merge1   A、B、A、B2   A、B3   A、A、B
复制代码


参考答案


select  a  ,collect_list(b) as bfrom(  select    *    ,min(if(b = 'B',nk,null)) over(partition by a) as minb  from(    select      *,row_number() over(partition by a order by date_id desc) nk    from(      select * from t21    ) t  ) t) twhere nk >= minbgroup by a;
复制代码

二十二、非等值连接--范围匹配

表 f 是事实表,表 d 是匹配表,在 hive 中如何将匹配表中的值关联到事实表中?


表 d 相当于拉链过的变化维,但日期范围可能是不全的。


表 f


date_id  p_id 2017    C 2018    B 2019    A 2013    C
复制代码


表 d


d_start    d_end    p_id   p_value 2016     2018     A       1 2016     2018     B       2 2008     2009     C       4 2010     2015     C       3
复制代码

问题一:范围匹配

输出结果如下所示


date_id  p_id   p_value 2017    C      null 2018    B      2 2019    A      null 2013    C      3
复制代码


**参考答案


此处给出两种解法,其一:


select   f.date_id,  f.p_id,  A.p_valuefrom f left join (  select     date_id,    p_id,    p_value  from   (    select       f.date_id,      f.p_id,      d.p_value    from f     left join d on f.p_id = d.p_id    where f.date_id >= d.d_start and f.date_id <= d.d_end  )A)AON f.date_id = A.date_id;
复制代码


其二:


select     date_id,    p_id,    flag as p_valuefrom (    select         f.date_id,        f.p_id,        d.d_start,        d.d_end,        d.p_value,        if(f.date_id between d.d_start and d.d_end,d.p_value,null) flag,        max(d.d_end) over(partition by date_id) max_end    from f    left join d    on f.p_id = d.p_id) tmpwhere d_end = max_end;
复制代码

二十三、非等值连接--最近匹配

表 t23_1 和表 t23_2 通过 a 和 b 关联时,有相等的取相等的值匹配,不相等时每一个 a 的值在 b 中找差值最小的来匹配。


t23_1 和 t23_2 为两个班的成绩单,t23_1 班的每个学生成绩在 t23_2 班中找出成绩最接近的成绩。


表 t23_1:a 中无重复值


a1245810
复制代码


表 t23_2:b 中无重复值


 b2371113
复制代码

问题一:单向最近匹配

输出结果如下所示


注意:b 的值可能会被丢弃


a    b1    22    24    35    35    78    710   11
复制代码


参考答案


select   * from(  select     ttt1.a,    ttt1.b   from  (    select       tt1.a,      t23_2.b,      dense_rank() over(partition by tt1.a order by abs(tt1.a-t23_2.b)) as dr     from     (      select         t23_1.a       from t23_1       left join t23_2 on t23_1.a=t23_2.b       where t23_2.b is null    ) tt1     cross join t23_2  ) ttt1   where ttt1.dr=1   union all  select     t23_1.a,    t23_2.b   from t23_1   inner join t23_2 on t23_1.a=t23_2.b) result_t order by result_t.a;
复制代码

二十四、N 指标--累计去重

假设表 A 为事件流水表,客户当天有一条记录则视为当天活跃。


表 A


   time_id          user_id2018-01-01 10:00:00    0012018-01-01 11:03:00    0022018-01-01 13:18:00    0012018-01-02 08:34:00    0042018-01-02 10:08:00    0022018-01-02 10:40:00    0032018-01-02 14:21:00    0022018-01-02 15:39:00    0042018-01-03 08:34:00    0052018-01-03 10:08:00    0032018-01-03 10:40:00    0012018-01-03 14:21:00    005
复制代码


假设客户活跃非常,一天产生的事件记录平均达千条。

问题一:累计去重

输出结果如下所示


日期       当日活跃人数     月累计活跃人数_截至当日date_id   user_cnt_act    user_cnt_act_month2018-01-01      2                22018-01-02      3                42018-01-03      3                5
复制代码


参考答案


SELECT  tt1.date_id       ,tt2.user_cnt_act       ,tt1.user_cnt_act_monthFROM(   -- ④ 按照t.date_id分组求出user_cnt_act_month,得到tt1  SELECT  t.date_id         ,COUNT(user_id) AS user_cnt_act_month  FROM  (   -- ③ 表a和表b进行笛卡尔积,按照a.date_id,b.user_id分组,保证截止到当日的用户唯一,得出表t。    SELECT  a.date_id           ,b.user_id    FROM    (   -- ① 按照日期分组,取出date_id字段当主表的维度字段 得出表a      SELECT  from_unixtime(unix_timestamp(time_id),'yyyy-MM-dd') AS date_id      FROM test.temp_tanhaidi_20211213_1      GROUP BY  from_unixtime(unix_timestamp(time_id),'yyyy-MM-dd')    ) a    INNER JOIN    (   -- ② 按照date_id、user_id分组,保证每天每个用户只有一条记录,得出表b      SELECT  from_unixtime(unix_timestamp(time_id),'yyyy-MM-dd') AS date_id             ,user_id      FROM test.temp_tanhaidi_20211213_1      GROUP BY  from_unixtime(unix_timestamp(time_id),'yyyy-MM-dd')               ,user_id    ) b    ON 1 = 1    WHERE a.date_id >= b.date_id    GROUP BY  a.date_id             ,b.user_id  ) t  GROUP BY  t.date_id) tt1LEFT JOIN(   -- ⑥ 按照date_id分组求出user_cnt_act,得到tt2  SELECT  date_id         ,COUNT(user_id) AS user_cnt_act  FROM  (   -- ⑤ 按照日期分组,取出date_id字段当主表的维度字段 得出表a    SELECT  from_unixtime(unix_timestamp(time_id),'yyyy-MM-dd') AS date_id           ,user_id    FROM test.temp_tanhaidi_20211213_1    GROUP BY  from_unixtime(unix_timestamp(time_id),'yyyy-MM-dd')             ,user_id  ) a  GROUP BY date_id) tt2ON tt2.date_id = tt1.date_id
复制代码




参考


最强最全面的大数据SQL经典面试题完整PDF版

发布于: 刚刚
用户头像

InfoQ签约作者 2020.11.10 加入

文章首发于公众号:五分钟学大数据。大数据领域原创技术号,深入大数据技术

评论

发布
暂无评论
最强最全面的大数据SQL面试系列