最强最全面的大数据 SQL 面试系列

作者：五分钟学大数据

2021 年 12 月 28 日
本文字数：16818 字
阅读完需：约 55 分钟

本套 SQL 题的答案是由许多小伙伴共同贡献的，1+1 的力量是远远大于 2 的，有不少题目都采用了非常巧妙的解法，也有不少题目有多种解法。本套大数据 SQL 题不仅题目丰富多样，答案更是精彩绝伦！

注：以下参考答案都经过简单数据场景进行测试通过，但并未测试其他复杂情况。本文档的 SQL 主要使用 Hive SQL。

因内容较多，带目录的 PDF 查看是比较方便的：

最强最全面的大数据SQL经典面试题完整PDF版

一、行列转换

描述：表中记录了各年份各部门的平均绩效考核成绩。

表名：t1

表结构：

a -- 年份b -- 部门c -- 绩效得分

复制代码

表内容：

 a   b  c2014  B  92015  A  82014  A  102015  B  7

复制代码

问题一：多行转多列

问题描述：将上述表内容转为如下输出结果所示：

 a  col_A col_B2014  10   92015  8    7

复制代码

参考答案：

select     a,    max(case when b="A" then c end) col_A,    max(case when b="B" then c end) col_Bfrom t1group by a;

复制代码

问题二：如何将结果转成源表？（多列转多行）

问题描述：将问题一的结果转成源表，问题一结果表名为t1_2。

参考答案：

select     a,    b,    cfrom (    select a,"A" as b,col_a as c from t1_2     union all     select a,"B" as b,col_b as c from t1_2  )tmp;

复制代码

问题三：同一部门会有多个绩效，求多行转多列结果

问题描述：2014 年公司组织架构调整，导致部门出现多个绩效，业务及人员不同，无法合并算绩效，源表内容如下：

2014  B  92015  A  82014  A  102015  B  72014  B  6

复制代码

输出结果如下所示：

 a    col_A  col_B2014   10    6,92015   8     7

复制代码

参考答案:

select     a,    max(case when b="A" then c end) col_A,    max(case when b="B" then c end) col_Bfrom (    select         a,        b,        concat_ws(",",collect_set(cast(c as string))) as c    from t1    group by a,b)tmpgroup by a;

复制代码

二、排名中取他值

表名：t2

表字段及内容：

a    b   c2014  A   32014  B   12014  C   22015  A   42015  D   3

复制代码

问题一：按 a 分组取 b 字段最小时对应的 c 字段

输出结果如下所示：

a   min_c2014  32015  4

复制代码

参考答案:

select  a,  c as min_cfrom(      select        a,        b,        c,        row_number() over(partition by a order by b) as rn       from t2 )awhere rn = 1;

复制代码

问题二：按 a 分组取 b 字段排第二时对应的 c 字段

输出结果如下所示：

 a  second_c2014  12015  3

复制代码

参考答案：

select  a,  c as second_cfrom(      select        a,        b,        c,        row_number() over(partition by a order by b) as rn       from t2 )awhere rn = 2;

复制代码

问题三：按 a 分组取 b 字段最小和最大时对应的 c 字段

输出结果如下所示：

a    min_c  max_c2014  3      22015  4      3

复制代码

参考答案:

select  a,  min(if(asc_rn = 1, c, null)) as min_c,  max(if(desc_rn = 1, c, null)) as max_cfrom(      select        a,        b,        c,        row_number() over(partition by a order by b) as asc_rn,        row_number() over(partition by a order by b desc) as desc_rn       from t2 )awhere asc_rn = 1 or desc_rn = 1group by a;

复制代码

问题四：按 a 分组取 b 字段第二小和第二大时对应的 c 字段

输出结果如下所示：

a    min_c  max_c2014  1      12015  3      4

复制代码

参考答案：

select    ret.a    ,max(case when ret.rn_min = 2 then ret.c else null end) as min_c    ,max(case when ret.rn_max = 2 then ret.c else null end) as max_cfrom (    select        *        ,row_number() over(partition by t2.a order by t2.b) as rn_min        ,row_number() over(partition by t2.a order by t2.b desc) as rn_max    from t2) as retwhere ret.rn_min = 2or ret.rn_max = 2group by ret.a;

复制代码

问题五：按 a 分组取 b 字段前两小和前两大时对应的 c 字段

注意：需保持 b 字段最小、最大排首位

输出结果如下所示：

a    min_c  max_c2014  3,1     2,12015  4,3     3,4

复制代码

参考答案：

select  tmp1.a as a,  min_c,  max_cfrom (  select     a,    concat_ws(',', collect_list(c)) as min_c  from    (     select       a,       b,       c,       row_number() over(partition by a order by b) as asc_rn     from t2     )a    where asc_rn <= 2     group by a )tmp1 join (  select     a,    concat_ws(',', collect_list(c)) as max_c  from    (     select        a,        b,        c,        row_number() over(partition by a order by b desc) as desc_rn      from t2    )a    where desc_rn <= 2    group by a )tmp2 on tmp1.a = tmp2.a;

复制代码

三、累计求值

表名：t3

表字段及内容：

a    b   c2014  A   32014  B   12014  C   22015  A   42015  D   3

复制代码

问题一：按 a 分组按 b 字段排序，对 c 累计求和

输出结果如下所示：

a    b   sum_c2014  A   32014  B   42014  C   62015  A   42015  D   7

复制代码

参考答案：

select   a,   b,   c,   sum(c) over(partition by a order by b) as sum_cfrom t3;

复制代码

问题二：按 a 分组按 b 字段排序，对 c 取累计平均值

输出结果如下所示：

a    b   avg_c2014  A   32014  B   22014  C   22015  A   42015  D   3.5

复制代码

参考答案：

select   a,   b,   c,   avg(c) over(partition by a order by b) as avg_cfrom t3;

复制代码

问题三：按 a 分组按 b 字段排序，对 b 取累计排名比例

输出结果如下所示：

a    b   ratio_c2014  A   0.332014  B   0.672014  C   1.002015  A   0.502015  D   1.00

复制代码

参考答案：

select   a,   b,   c,   round(row_number() over(partition by a order by b) / (count(c) over(partition by a)),2) as ratio_cfrom t3 order by a,b;

复制代码

问题四：按 a 分组按 b 字段排序，对 b 取累计求和比例

输出结果如下所示：

a    b   ratio_c2014  A   0.502014  B   0.672014  C   1.002015  A   0.572015  D   1.00

复制代码

参考答案：

select   a,   b,   c,   round(sum(c) over(partition by a order by b) / (sum(c) over(partition by a)),2) as ratio_cfrom t3 order by a,b;

复制代码

四、窗口大小控制

表名：t4

表字段及内容：

a    b   c2014  A   32014  B   12014  C   22015  A   42015  D   3

复制代码

问题一：按 a 分组按 b 字段排序，对 c 取前后各一行的和

输出结果如下所示：

a    b   sum_c2014  A   12014  B   52014  C   12015  A   32015  D   4

复制代码

参考答案：

select   a,  b,  lag(c,1,0) over(partition by a order by b)+lead(c,1,0) over(partition by a order by b) as sum_cfrom t4;

复制代码

问题二：按 a 分组按 b 字段排序，对 c 取平均值

问题描述：前一行与当前行的均值！

输出结果如下所示：

a    b   avg_c2014  A   32014  B   22014  C   1.52015  A   42015  D   3.5

复制代码

参考答案：

此处给出两种解法，其一：

select    a,    b,    avg(c) over(partition by a order by b rows between 1 preceding and current row )fromt4;

复制代码

其二：

select  a,  b,  case when lag_c is null then c  else (c+lag_c)/2 end as avg_cfrom ( select   a,   b,   c,   lag(c,1) over(partition by a order by b) as lag_c  from t4 )temp;

复制代码

五、产生连续数值

输出结果如下所示：

12345...100

复制代码

参考答案：

不借助其他任何外表，实现产生连续数值

此处给出两种解法，其一：

selectid_start+pos as idfrom(    select    1 as id_start,    1000000 as id_end) m  lateral view posexplode(split(space(id_end-id_start), '')) t as pos, val

复制代码

其二：

select  row_number() over() as idfrom    (select split(space(99), ' ') as x) tlateral viewexplode(x) ex;

复制代码

那如何产生 1 至 1000000 连续数值？

参考答案：

select  row_number() over() as idfrom    (select split(space(999999), ' ') as x) tlateral viewexplode(x) ex;

复制代码

六、数据扩充与收缩

表名：t6

表字段及内容：

a324

复制代码

问题一：数据扩充

输出结果如下所示：

a     b3   3、2、12   2、14   4、3、2、1

复制代码

参考答案：

select    t.a,  concat_ws('、',collect_set(cast(t.rn as string))) as bfrom(    select      t6.a,    b.rn  from t6  left join  (    select     row_number() over() as rn   from     (select split(space(5), ' ') as x) t -- space(5)可根据t6表的最大值灵活调整   lateral view   explode(x) pe  ) b  on 1 = 1  where t6.a >= b.rn  order by t6.a, b.rn desc ) tgroup by  t.a;

复制代码

问题二：数据扩充，排除偶数

输出结果如下所示：

a     b3   3、12   14   3、1

复制代码

参考答案：

select    t.a,  concat_ws('、',collect_set(cast(t.rn as string))) as bfrom(    select      t6.a,    b.rn  from t6  left join  (    select     row_number() over() as rn   from     (select split(space(5), ' ') as x) t   lateral view   explode(x) pe  ) b  on 1 = 1  where t6.a >= b.rn and b.rn % 2 = 1  order by t6.a, b.rn desc ) tgroup by  t.a;

复制代码

问题三：如何处理字符串累计拼接

问题描述：将小于等于 a 字段的值聚合拼接起来

输出结果如下所示：

a     b3     2、32     24     2、3、4

复制代码

参考答案：

select    t.a,  concat_ws('、',collect_set(cast(t.a1 as string))) as bfrom(     select      t6.a,    b.a1  from t6  left join  (      select  a as a1    from t6  ) b  on 1 = 1  where t6.a >= b.a1  order by t6.a, b.a1 ) tgroup by  t.a;

复制代码

问题四：如果 a 字段有重复，如何实现字符串累计拼接

输出结果如下所示：

a     b2     23     2、33     2、3、34     2、3、3、4

复制代码

参考答案：

select   a,  bfrom ( select     t.a,   t.rn,   concat_ws('、',collect_list(cast(t.a1 as string))) as b from  (       select       a.a,     a.rn,     b.a1    from    (     select         a,       row_number() over(order by a ) as rn      from t6    ) a    left join    (        select  a as a1,     row_number() over(order by a ) as rn       from t6    ) b    on 1 = 1    where a.a >= b.a1 and a.rn >= b.rn     order by a.a, b.a1   ) t  group by  t.a,t.rn  order by t.a,t.rn) tt;

复制代码

问题五：数据展开

问题描述：如何将字符串"1-5,16,11-13,9"扩展成"1,2,3,4,5,16,11,12,13,9"？注意顺序不变。

参考答案：

select    concat_ws(',',collect_list(cast(rn as string)))from(  select     a.rn,   b.num,   b.pos  from   (    select     row_number() over() as rn    from (select split(space(20), ' ') as x) t -- space(20)可灵活调整    lateral view    explode(x) pe   ) a lateral view outer    posexplode(split('1-5,16,11-13,9', ',')) b as pos, num   where a.rn between cast(split(num, '-')[0] as int) and cast(split(num, '-')[1] as int) or a.rn = num   order by pos, rn ) t;

复制代码

七、合并与拆分

表名：t7

表字段及内容：

a    b2014  A2014  B2015  B2015  D

复制代码

问题一：合并

输出结果如下所示：

2014  A、B2015  B、D

复制代码

参考答案:

select  a,  concat_ws('、', collect_set(t.b)) bfrom t7group by a;

复制代码

问题二：拆分

问题描述：将分组合并的结果拆分出来

参考答案：

select  t.a,  dfrom( select  a,  concat_ws('、', collect_set(t7.b)) b from t7 group by a)tlateral view explode(split(t.b, '、')) table_tmp as d;

复制代码

八、模拟循环操作

表名：t8

表字段及内容：

a10110101

复制代码

问题一：如何将字符'1'的位置提取出来

输出结果如下所示:

1,3,42,4

复制代码

参考答案：

select     a,    concat_ws(",",collect_list(cast(index as string))) as resfrom (    select         a,        index+1 as index,        chr    from (        select             a,            concat_ws(",",substr(a,1,1),substr(a,2,1),substr(a,3,1),substr(a,-1)) str        from t8    ) tmp1    lateral view posexplode(split(str,",")) t as index,chr    where chr = "1") tmp2group by a;

复制代码

九、不使用 distinct 或 group by 去重

表名：t9

表字段及内容：

a     b     c    d2014  2016  2014   A2014  2015  2015   B

复制代码

问题一：不使用 distinct 或 group by 去重

输出结果如下所示：

2014  A2016  A2014  B2015  B

复制代码

参考答案：

select  t2.year  ,t2.numfrom (  select    *    ,row_number() over (partition by t1.year,t1.num) as rank_1  from   (    select       a as year,      d as num    from t9    union all    select       b as year,      d as num    from t9    union all    select       c as year,      d as num    from t9   )t1)t2where rank_1=1order by num;

复制代码

十、容器--反转内容

表名：t10

表字段及内容：

aAB,CA,BADBD,EA

复制代码

问题一：反转逗号分隔的数据：改变顺序，内容不变

输出结果如下所示：

BAD,CA,ABEA,BD

复制代码

参考答案：

select   a,  concat_ws(",",collect_list(reverse(str)))from (  select     a,    str  from t10  lateral view explode(split(reverse(a),",")) t as str) tmp1group by a;

复制代码

问题二：反转逗号分隔的数据：改变内容，顺序不变

输出结果如下所示：

BA,AC,DABDB,AE

复制代码

参考答案：

select   a,  concat_ws(",",collect_list(reverse(str)))from (  select      a,     str  from t10  lateral view explode(split(a,",")) t as str) tmp1group by a;

复制代码

十一、多容器--成对提取数据

表名：t11

表字段及内容：

a       bA/B     1/3B/C/D   4/5/2

复制代码

问题一：成对提取数据，字段一一对应

输出结果如下所示：

a       bA       1B       3B       4C       5D       2

复制代码

参考答案:

select   a_inx,  b_inxfrom (  select      a,     b,     a_id,     a_inx,     b_id,     b_inx  from t11  lateral view posexplode(split(a,'/')) t as a_id,a_inx  lateral view posexplode(split(b,'/')) t as b_id,b_inx) tmpwhere a_id=b_id;

复制代码

十二、多容器--转多行

表名：t12

表字段及内容：

a        b      c001     A/B     1/3/5002     B/C/D   4/5

复制代码

问题一：转多行

输出结果如下所示：

a        d       e001     type_b    A001     type_b    B001     type_c    1001     type_c    3001     type_c    5002     type_b    B002     type_b    C002     type_b    D002     type_c    4002     type_c    5

复制代码

参考答案:

select   a,  d,  efrom (  select    a,    "type_b" as d,    str as e  from t12  lateral view explode(split(b,"/")) t as str  union all   select    a,    "type_c" as d,    str as e  from t12  lateral view explode(split(c,"/")) t as str) tmporder by a,d;

复制代码

十三、抽象分组--断点排序

表名：t13

表字段及内容：

a    b2014  12015  12016  12017  02018  02019  -12020  -12021  -12022  12023  1

复制代码

问题一：断点排序

输出结果如下所示：

a    b    c 2014  1    12015  1    22016  1    32017  0    12018  0    22019  -1   12020  -1   22021  -1   32022  1    12023  1    2

复制代码

参考答案:

select    a,  b,  row_number() over( partition by b,repair_a order by a asc) as c--按照b列和[b的组首]分组，排序from (  select      a,    b,    a-b_rn as repair_a--根据b列值出现的次序,修复a列值为b首次出现的a列值,称为b的[组首]  from   (   select      a,     b,     row_number() over( partition by b order by  a  asc ) as b_rn--按b列分组,按a列排序,得到b列各值出现的次序   from t13   )tmp1)tmp2--注意，如果不同的b列值，可能出现同样的组首值，但组首值需要和a列值 一并参与分组，故并不影响排序。order by a asc;

复制代码

十四、业务逻辑的分类与抽象--时效

日期表：d_date

表字段及内容：

date_id      is_work2017-04-13       12017-04-14       12017-04-15       02017-04-16       02017-04-17       1

复制代码

工作日：周一至周五 09:30-18:30

客户申请表：t14

表字段及内容：

a      b       c1     申请   2017-04-14 18:03:001     通过   2017-04-17 09:43:002     申请   2017-04-13 17:02:002     通过   2017-04-15 09:42:00

复制代码

问题一：计算上表中从申请到通过占用的工作时长

输出结果如下所示：

a         d1        0.67h2       10.67h

复制代码

参考答案:

select     a,    round(sum(diff)/3600,2) as dfrom (    select         a,        apply_time,        pass_time,        dates,        rn,        ct,        is_work,        case when is_work=1 and rn=1 then unix_timestamp(concat(dates,' 18:30:00'),'yyyy-MM-dd HH:mm:ss')-unix_timestamp(apply_time,'yyyy-MM-dd HH:mm:ss')            when is_work=0 then 0            when is_work=1 and rn=ct then unix_timestamp(pass_time,'yyyy-MM-dd HH:mm:ss')-unix_timestamp(concat(dates,' 09:30:00'),'yyyy-MM-dd HH:mm:ss')            when is_work=1 and rn!=ct then 9*3600        end diff    from (        select             a,            apply_time,            pass_time,            time_diff,            day_diff,            rn,            ct,            date_add(start,rn-1) dates        from (            select                 a,                apply_time,                pass_time,                time_diff,                day_diff,                strs,                start,                row_number() over(partition by a) as rn,                count(*) over(partition by a) as ct            from (                select                     a,                    apply_time,                    pass_time,                    time_diff,                    day_diff,                    substr(repeat(concat(substr(apply_time,1,10),','),day_diff+1),1,11*(day_diff+1)-1) strs                from (                    select                         a,                        apply_time,                        pass_time,                        unix_timestamp(pass_time,'yyyy-MM-dd HH:mm:ss')-unix_timestamp(apply_time,'yyyy-MM-dd HH:mm:ss') time_diff,                        datediff(substr(pass_time,1,10),substr(apply_time,1,10)) day_diff                    from (                        select                             a,                            max(case when b='申请' then c end) apply_time,                            max(case when b='通过' then c end) pass_time                        from t14                        group by a                    ) tmp1                ) tmp2            ) tmp3             lateral view explode(split(strs,",")) t as start        ) tmp4    ) tmp5    join d_date     on tmp5.dates = d_date.date_id) tmp6group by a;

复制代码

十五、时间序列--进度及剩余

表名：t15

表字段及内容：

date_id      is_work2017-07-30      02017-07-31      12017-08-01      12017-08-02      12017-08-03      12017-08-04      12017-08-05      02017-08-06      02017-08-07      1

复制代码

问题一：求每天的累计周工作日，剩余周工作日

输出结果如下所示：

date_id      week_to_work  week_left_work2017-07-31      1             42017-08-01      2             32017-08-02      3             22017-08-03      4             12017-08-04      5             02017-08-05      5             02017-08-06      5             0

复制代码

参考答案:

此处给出两种解法，其一：

select  date_id,case date_format(date_id,'u')    when 1 then 1    when 2 then 2     when 3 then 3     when 4 then 4    when 5 then 5     when 6 then 5     when 7 then 5  end as week_to_work,case date_format(date_id,'u')    when 1 then 4    when 2 then 3      when 3 then 2     when 4 then 1    when 5 then 0     when 6 then 0     when 7 then 0  end as week_to_workfrom t15

复制代码

其二：

selectdate_id,week_to_work,week_sum_work-week_to_work as week_left_workfrom(    select    date_id,    sum(is_work) over(partition by year,week order by date_id) as week_to_work,    sum(is_work) over(partition by year,week) as week_sum_work    from(        select        date_id,        is_work,        year(date_id) as year,        weekofyear(date_id) as week        from t15    ) ta) tb order by date_id;

复制代码

十六、时间序列--构造日期

问题一：直接使用 SQL 实现一张日期维度表，包含以下字段：

date                  string                日期d_week                string                年内第几周weeks                 int                   周几w_start               string                周开始日w_end                 string                周结束日d_month              int                   第几月m_start              string                月开始日m_end                string                月结束日d_quarter            int                    第几季q_start              string                季开始日q_end                string                季结束日d_year               int                    年份y_start              string                年开始日y_end                string                年结束日

复制代码

参考答案：

drop table if exists dim_date;create table if not exists dim_date(    `date` string comment '日期',    d_week string comment '年内第几周',    weeks string comment '周几',    w_start string comment '周开始日',    w_end string comment '周结束日',    d_month string comment '第几月',    m_start string comment '月开始日',    m_end string comment '月结束日',    d_quarter int comment '第几季',    q_start string comment '季开始日',    q_end string comment '季结束日',    d_year int comment '年份',    y_start string comment '年开始日',    y_end string comment '年结束日');--自然月: 指每月的1号到那个月的月底，它是按照阳历来计算的。就是从每月1号到月底，不管这个月有30天，31天，29天或者28天，都算是一个自然月。
insert overwrite table dim_dateselect `date`     , d_week --年内第几周     , case weekid           when 0 then '周日'           when 1 then '周一'           when 2 then '周二'           when 3 then '周三'           when 4 then '周四'           when 5 then '周五'           when 6 then '周六'    end  as weeks -- 周     , date_add(next_day(`date`,'MO'),-7) as w_start --周一     , date_add(next_day(`date`,'MO'),-1) as w_end   -- 周日_end     -- 月份日期     , concat('第', monthid, '月')  as d_month     , m_start     , m_end
     -- 季节     , quarterid as d_quart     , concat(d_year, '-', substr(concat('0', (quarterid - 1) * 3 + 1), -2), '-01') as q_start --季开始日     , date_sub(concat(d_year, '-', substr(concat('0', (quarterid) * 3 + 1), -2), '-01'), 1) as q_end   --季结束日     -- 年     , d_year     , y_start     , y_end

from (         select `date`              , pmod(datediff(`date`, '2012-01-01'), 7)                  as weekid    --获取周几              , cast(substr(`date`, 6, 2) as int)                        as monthid   --获取月份              , case                    when cast(substr(`date`, 6, 2) as int) <= 3 then 1                    when cast(substr(`date`, 6, 2) as int) <= 6 then 2                    when cast(substr(`date`, 6, 2) as int) <= 9 then 3                    when cast(substr(`date`, 6, 2) as int) <= 12 then 4             end                                                       as quarterid --获取季节 可以直接使用 quarter(`date`)              , substr(`date`, 1, 4)                                     as d_year    -- 获取年份              , trunc(`date`, 'YYYY')                                    as y_start   --年开始日              , date_sub(trunc(add_months(`date`, 12), 'YYYY'), 1) as y_end     --年结束日              , date_sub(`date`, dayofmonth(`date`) - 1)                 as m_start   --当月第一天              , last_day(date_sub(`date`, dayofmonth(`date`) - 1))          m_end     --当月最后一天              , weekofyear(`date`)                                       as d_week    --年内第几周         from (                    -- '2021-04-01'是开始日期, '2022-03-31'是截止日期                  select date_add('2021-04-01', t0.pos) as `date`                  from (                           select posexplode(                                          split(                                                  repeat('o', datediff(                                                          from_unixtime(unix_timestamp('2022-03-31', 'yyyy-mm-dd'),                                                                        'yyyy-mm-dd'),                                                          '2021-04-01')), 'o'                                              )                                      )                       ) t0              ) t1     ) t2;

复制代码

十七、时间序列--构造累积日期

表名：t17

表字段及内容：

date_id2017-08-012017-08-022017-08-03

复制代码

问题一：每一日期，都扩展成月初至当天

输出结果如下所示：

date_id    date_to_day2017-08-01   2017-08-012017-08-02   2017-08-012017-08-02   2017-08-022017-08-03   2017-08-012017-08-03   2017-08-022017-08-03   2017-08-03

复制代码

这种累积相关的表，常做桥接表。

参考答案:

select  date_id,  date_add(date_start_id,pos) as date_to_dayfrom(  select    date_id,    date_sub(date_id,dayofmonth(date_id)-1) as date_start_id  from t17) m  lateral view posexplode(split(space(datediff(from_unixtime(unix_timestamp(date_id,'yyyy-MM-dd')),from_unixtime(unix_timestamp(date_start_id,'yyyy-MM-dd')))), '')) t as pos, val;

复制代码

十八、时间序列--构造连续日期

表名：t18

表字段及内容：

a             b         c101        2018-01-01     10101        2018-01-03     20101        2018-01-06     40102        2018-01-02     20102        2018-01-04     30102        2018-01-07     60

复制代码

问题一：构造连续日期

问题描述：将表中数据的 b 字段扩充至范围[2018-01-01, 2018-01-07]，并累积对 c 求和。

b 字段的值是较稀疏的。

输出结果如下所示：

a             b          c      d101        2018-01-01     10     10101        2018-01-02      0     10101        2018-01-03     20     30101        2018-01-04      0     30101        2018-01-05      0     30101        2018-01-06     40     70101        2018-01-07      0     70102        2018-01-01      0      0102        2018-01-02     20     20102        2018-01-03      0     20102        2018-01-04     30     50102        2018-01-05      0     50102        2018-01-06      0     50102        2018-01-07     60    110

复制代码

参考答案:

select  a,  b,  c,  sum(c) over(partition by a order by b) as dfrom(  select  t1.a,  t1.b,  case    when t18.b is not null then t18.c    else 0  end as c  from  (    select    a,    date_add(s,pos) as b    from    (      select        a,        '2018-01-01' as s,        '2018-01-07' as r      from (select a from t18 group by a) ta    ) m  lateral view       posexplode(split(space(datediff(from_unixtime(unix_timestamp(r,'yyyy-MM-dd')),from_unixtime(unix_timestamp(s,'yyyy-MM-dd')))), '')) t as pos, val  ) t1    left join t18    on  t1.a = t18.a and t1.b = t18.b) ts;

复制代码

十九、时间序列--取多个字段最新的值

表名：t19

表字段及内容：

date_id   a   b    c2014     AB  12    bc2015         23    2016               d2017     BC

复制代码

问题一：如何一并取出最新日期

输出结果如下所示：

date_a   a    date_b    b    date_c   c2017    BC    2015     23    2016    d

复制代码

参考答案:

此处给出三种解法，其一：

SELECT  max(CASE WHEN rn_a = 1 THEN date_id else 0 END) AS date_a        ,max(CASE WHEN rn_a = 1 THEN a else null END) AS a        ,max(CASE WHEN rn_b = 1 THEN date_id else 0 END) AS date_b        ,max(CASE WHEN rn_b = 1 THEN b else NULL  END) AS b        ,max(CASE WHEN rn_c = 1 THEN date_id  else 0 END) AS date_c        ,max(CASE WHEN rn_c = 1 THEN c else null END) AS cFROM    (            SELECT  date_id                    ,a                    ,b                    ,c                    --对每列上不为null的值  的 日期 进行排序                    ,row_number()OVER( PARTITION BY 1 ORDER BY CASE WHEN a IS NULL THEN 0 ELSE date_id END DESC) AS rn_a                    ,row_number()OVER(PARTITION BY 1 ORDER BY CASE WHEN b IS NULL THEN 0 ELSE date_id END DESC) AS rn_b                    ,row_number()OVER(PARTITION BY 1 ORDER BY CASE WHEN c IS NULL THEN 0 ELSE date_id END DESC) AS rn_c            FROM    t19        ) tWHERE   t.rn_a = 1OR      t.rn_b = 1OR      t.rn_c = 1;

复制代码

其二：

SELECT     a.date_id  ,a.a  ,b.date_id  ,b.b  ,c.date_id  ,c.cFROM(   SELECT        t.date_id,      t.a   FROM     (     SELECT         t.date_id       ,t.a       ,t.b       ,t.c     FROM t19 t INNER JOIN    t19 t1 ON t.date_id = t1.date_id AND t.a IS NOT NULL   ) t   ORDER BY t.date_id DESC   LIMIT 1) aLEFT JOIN (  SELECT      t.date_id    ,t.b  FROM      (    SELECT        t.date_id      ,t.b    FROM t19 t INNER JOIN t19 t1 ON t.date_id = t1.date_id AND t.b IS NOT NULL  ) t  ORDER BY t.date_id DESC  LIMIT 1) b ON 1 = 1 LEFT JOIN(  SELECT      t.date_id    ,t.c  FROM      (    SELECT        t.date_id      ,t.c    FROM t19 t INNER JOIN t19 t1 ON t.date_id = t1.date_id AND t.c IS NOT NULL  ) t  ORDER BY t.date_id DESC  LIMIT   1) cON 1 = 1;

复制代码

其三：

select   * from  (  select t1.date_id as date_a,t1.a from (select t1.date_id,t1.a  from t19 t1 where t1.a is not null) t1  inner join (select max(t1.date_id) as date_id   from t19 t1 where t1.a is not null) t2  on t1.date_id=t2.date_id) t1cross join(  select t1.date_b,t1.b from (select t1.date_id as date_b,t1.b  from t19 t1 where t1.b is not null) t1  inner join (select max(t1.date_id) as date_id   from t19 t1 where t1.b is not null)t2  on t1.date_b=t2.date_id) t2cross join (  select t1.date_c,t1.c from (select t1.date_id as date_c,t1.c  from t19 t1 where t1.c is not null) t1  inner join (select max(t1.date_id) as date_id   from t19 t1 where t1.c is not null)t2  on t1.date_c=t2.date_id) t3;

复制代码

二十、时间序列--补全数据

表名：t20

表字段及内容：

date_id   a   b    c2014     AB  12    bc2015         23    2016               d2017     BC

复制代码

问题一：如何使用最新数据补全表格

输出结果如下所示：

date_id   a   b    c2014     AB  12    bc2015     AB  23    bc2016     AB  23    d2017     BC  23    d

复制代码

参考答案:

select   date_id,   first_value(a) over(partition by aa order by date_id) as a,  first_value(b) over(partition by bb order by date_id) as b,  first_value(c) over(partition by cc order by date_id) as cfrom(  select     date_id,    a,    b,    c,    count(a) over(order by date_id) as aa,    count(b) over(order by date_id) as bb,    count(c) over(order by date_id) as cc  from t20)tmp1;

复制代码

二十一、时间序列--取最新完成状态的前一个状态

表名：t21

表字段及内容：

date_id   a    b2014     1    A2015     1    B2016     1    A2017     1    B2013     2    A2014     2    B2015     2    A2014     3    A2015     3    A2016     3    B2017     3    A

复制代码

上表中 B 为完成状态。

问题一：取最新完成状态的前一个状态

输出结果如下所示：

date_id  a    b2016     1    A2013     2    A2015     3    A

复制代码

参考答案:

此处给出两种解法，其一：

select    t21.date_id,    t21.a,    t21.bfrom    (        select            max(date_id) date_id,            a        from            t21        where            b = 'B'        group by            a    ) t1    inner join t21 on t1.date_id -1 = t21.date_idand t1.a = t21.a;

复制代码

其二：

select  next_date_id as date_id  ,a  ,next_b as bfrom(  select    *,min(nk) over(partition by a,b) as minb  from(    select      *,row_number() over(partition by a order by date_id desc) nk      ,lead(date_id) over(partition by a order by date_id desc) next_date_id      ,lead(b) over(partition by a order by date_id desc) next_b    from(      select * from t21    ) t  ) t) twhere minb = nk and b = 'B';

复制代码

问题二：如何将完成状态的过程合并

输出结果如下所示:

a   b_merge1   A、B、A、B2   A、B3   A、A、B

复制代码

参考答案：

select  a  ,collect_list(b) as bfrom(  select    *    ,min(if(b = 'B',nk,null)) over(partition by a) as minb  from(    select      *,row_number() over(partition by a order by date_id desc) nk    from(      select * from t21    ) t  ) t) twhere nk >= minbgroup by a;

复制代码

二十二、非等值连接--范围匹配

表 f 是事实表，表 d 是匹配表，在 hive 中如何将匹配表中的值关联到事实表中？

表 d 相当于拉链过的变化维，但日期范围可能是不全的。

表 f：

date_id  p_id 2017    C 2018    B 2019    A 2013    C

复制代码

表 d：

d_start    d_end    p_id   p_value 2016     2018     A       1 2016     2018     B       2 2008     2009     C       4 2010     2015     C       3

复制代码

问题一：范围匹配

输出结果如下所示：

date_id  p_id   p_value 2017    C      null 2018    B      2 2019    A      null 2013    C      3

复制代码

**参考答案：

此处给出两种解法，其一：

select   f.date_id,  f.p_id,  A.p_valuefrom f left join (  select     date_id,    p_id,    p_value  from   (    select       f.date_id,      f.p_id,      d.p_value    from f     left join d on f.p_id = d.p_id    where f.date_id >= d.d_start and f.date_id <= d.d_end  )A)AON f.date_id = A.date_id;

复制代码

其二：

select     date_id,    p_id,    flag as p_valuefrom (    select         f.date_id,        f.p_id,        d.d_start,        d.d_end,        d.p_value,        if(f.date_id between d.d_start and d.d_end,d.p_value,null) flag,        max(d.d_end) over(partition by date_id) max_end    from f    left join d    on f.p_id = d.p_id) tmpwhere d_end = max_end;

复制代码

二十三、非等值连接--最近匹配

表 t23_1 和表 t23_2 通过 a 和 b 关联时，有相等的取相等的值匹配，不相等时每一个 a 的值在 b 中找差值最小的来匹配。

t23_1 和 t23_2 为两个班的成绩单，t23_1 班的每个学生成绩在 t23_2 班中找出成绩最接近的成绩。

表 t23_1：a 中无重复值

a1245810

复制代码

表 t23_2：b 中无重复值

b2371113

复制代码

问题一：单向最近匹配

输出结果如下所示：

注意：b 的值可能会被丢弃

a    b1    22    24    35    35    78    710   11

复制代码

参考答案：

select   * from(  select     ttt1.a,    ttt1.b   from  (    select       tt1.a,      t23_2.b,      dense_rank() over(partition by tt1.a order by abs(tt1.a-t23_2.b)) as dr     from     (      select         t23_1.a       from t23_1       left join t23_2 on t23_1.a=t23_2.b       where t23_2.b is null    ) tt1     cross join t23_2  ) ttt1   where ttt1.dr=1   union all  select     t23_1.a,    t23_2.b   from t23_1   inner join t23_2 on t23_1.a=t23_2.b) result_t order by result_t.a;

复制代码

二十四、N 指标--累计去重

假设表 A 为事件流水表，客户当天有一条记录则视为当天活跃。

表 A：

   time_id          user_id2018-01-01 10:00:00    0012018-01-01 11:03:00    0022018-01-01 13:18:00    0012018-01-02 08:34:00    0042018-01-02 10:08:00    0022018-01-02 10:40:00    0032018-01-02 14:21:00    0022018-01-02 15:39:00    0042018-01-03 08:34:00    0052018-01-03 10:08:00    0032018-01-03 10:40:00    0012018-01-03 14:21:00    005

复制代码

假设客户活跃非常，一天产生的事件记录平均达千条。

问题一：累计去重

输出结果如下所示：

日期       当日活跃人数     月累计活跃人数_截至当日date_id   user_cnt_act    user_cnt_act_month2018-01-01      2                22018-01-02      3                42018-01-03      3                5

复制代码

参考答案：

SELECT  tt1.date_id       ,tt2.user_cnt_act       ,tt1.user_cnt_act_monthFROM(   -- ④ 按照t.date_id分组求出user_cnt_act_month，得到tt1  SELECT  t.date_id         ,COUNT(user_id) AS user_cnt_act_month  FROM  (   -- ③ 表a和表b进行笛卡尔积，按照a.date_id,b.user_id分组，保证截止到当日的用户唯一，得出表t。    SELECT  a.date_id           ,b.user_id    FROM    (   -- ① 按照日期分组，取出date_id字段当主表的维度字段 得出表a      SELECT  from_unixtime(unix_timestamp(time_id),'yyyy-MM-dd') AS date_id      FROM test.temp_tanhaidi_20211213_1      GROUP BY  from_unixtime(unix_timestamp(time_id),'yyyy-MM-dd')    ) a    INNER JOIN    (   -- ② 按照date_id、user_id分组，保证每天每个用户只有一条记录，得出表b      SELECT  from_unixtime(unix_timestamp(time_id),'yyyy-MM-dd') AS date_id             ,user_id      FROM test.temp_tanhaidi_20211213_1      GROUP BY  from_unixtime(unix_timestamp(time_id),'yyyy-MM-dd')               ,user_id    ) b    ON 1 = 1    WHERE a.date_id >= b.date_id    GROUP BY  a.date_id             ,b.user_id  ) t  GROUP BY  t.date_id) tt1LEFT JOIN(   -- ⑥ 按照date_id分组求出user_cnt_act，得到tt2  SELECT  date_id         ,COUNT(user_id) AS user_cnt_act  FROM  (   -- ⑤ 按照日期分组，取出date_id字段当主表的维度字段 得出表a    SELECT  from_unixtime(unix_timestamp(time_id),'yyyy-MM-dd') AS date_id           ,user_id    FROM test.temp_tanhaidi_20211213_1    GROUP BY  from_unixtime(unix_timestamp(time_id),'yyyy-MM-dd')             ,user_id  ) a  GROUP BY date_id) tt2ON tt2.date_id = tt1.date_id

复制代码

参考：

最强最全面的大数据SQL经典面试题完整PDF版

发布于: 刚刚

原文链接:【http://xie.infoq.cn/article/6408c32b4451f2bc8615d75cc】。未经作者许可，禁止转载。

五分钟学大数据

关注

InfoQ签约作者 2020.11.10 加入

文章首发于公众号：五分钟学大数据。大数据领域原创技术号，深入大数据技术

发布

暂无评论

创作场景

最强最全面的大数据 SQL 面试系列

一、行列转换

问题一：多行转多列

问题二：如何将结果转成源表？（多列转多行）

问题三：同一部门会有多个绩效，求多行转多列结果

二、排名中取他值

问题一：按 a 分组取 b 字段最小时对应的 c 字段

问题二：按 a 分组取 b 字段排第二时对应的 c 字段

问题三：按 a 分组取 b 字段最小和最大时对应的 c 字段

问题四：按 a 分组取 b 字段第二小和第二大时对应的 c 字段

问题五：按 a 分组取 b 字段前两小和前两大时对应的 c 字段

三、累计求值

问题一：按 a 分组按 b 字段排序，对 c 累计求和

问题二：按 a 分组按 b 字段排序，对 c 取累计平均值

问题三：按 a 分组按 b 字段排序，对 b 取累计排名比例

问题四：按 a 分组按 b 字段排序，对 b 取累计求和比例

四、窗口大小控制

问题一：按 a 分组按 b 字段排序，对 c 取前后各一行的和

问题二：按 a 分组按 b 字段排序，对 c 取平均值

五、产生连续数值

六、数据扩充与收缩

问题一：数据扩充

问题二：数据扩充，排除偶数

问题三：如何处理字符串累计拼接

问题四：如果 a 字段有重复，如何实现字符串累计拼接

问题五：数据展开

七、合并与拆分

问题一：合并

问题二：拆分

八、模拟循环操作

问题一：如何将字符'1'的位置提取出来

九、不使用 distinct 或 group by 去重

问题一：不使用 distinct 或 group by 去重

十、容器--反转内容

问题一：反转逗号分隔的数据：改变顺序，内容不变

问题二：反转逗号分隔的数据：改变内容，顺序不变

十一、多容器--成对提取数据

问题一：成对提取数据，字段一一对应

十二、多容器--转多行

问题一：转多行

十三、抽象分组--断点排序

问题一：断点排序

十四、业务逻辑的分类与抽象--时效

问题一：计算上表中从申请到通过占用的工作时长

十五、时间序列--进度及剩余

问题一：求每天的累计周工作日，剩余周工作日

十六、时间序列--构造日期

问题一：直接使用 SQL 实现一张日期维度表，包含以下字段：

十七、时间序列--构造累积日期

问题一：每一日期，都扩展成月初至当天

十八、时间序列--构造连续日期

问题一：构造连续日期

十九、时间序列--取多个字段最新的值

问题一：如何一并取出最新日期

二十、时间序列--补全数据

问题一：如何使用最新数据补全表格

二十一、时间序列--取最新完成状态的前一个状态

问题一：取最新完成状态的前一个状态

问题二：如何将完成状态的过程合并

二十二、非等值连接--范围匹配

问题一：范围匹配

二十三、非等值连接--最近匹配

问题一：单向最近匹配

二十四、N 指标--累计去重

问题一：累计去重

五分钟学大数据

评论