写点什么

时空碰撞优化系列·一

发布于: 2020 年 09 月 22 日

优化源于痛点(┬_┬)



有没有痛点取决于业务场景的需求;有多痛取决于当前方案对业务的契合度



让我们从业务场景①、当前方案②切入,联立①②来推导当前痛点③吧!



话不多说,开始分析



①业务场景:



1.需要计算时间和空间都在一定范围内的数据对,数据来源可能有多种类型,人、车、码等



2.计算需要并行进行,每次计算一天的数据量,大约亿级



②当前方案:



先按照前文的同行从时间上划分,然后按照geohash从空间上划分,在边界点处理上用的是先计算好所有数据的geohash映射,然后广播到所有节点



③当前痛点:



1.计算空间范围大的时候映射map会爆炸



2.计算过程大量重复,去重逻辑繁琐,浪费大量算力



那么问题来了,是否存在什么更合适的方案来解决这些痛点呢?



我想,是有的。



根据痛点③,反推我们的预期目标④;



根据目标④,尝试推导出优化思路⑤;



落地思路⑤,成为最终的优化方案⑥



④预期目标



1.不整串行计算,全程并行



2.避免重复计算,一键去重



⑤优化思路



1.直接在sql里分好片



2.从数学上逻辑推导来解决重复计算问题,把一维同行的逻辑扩展到三维



⑥优化方案



test:原表,存储需要计算的数据



时间间隔10S



空间间隔150米



①假定取一天数据,我们可以从时间、经度、纬度三个维度去对数据做划分



with event as (
select
aid,
data_type,
lat,
lon,
time,
time_part,
lat_part,
lon_partfrom(
select
aid,
data_type,
lat,
lon,
time,
floor(unix_timestamp(time)/10) as time_part,
floor(lat/0.001) as lat_part,
floor(lon/0.001) as lon_part,
row_number() over(partition by aid,time,location order by time) as rn
from test where data_type in ('man','woman')
) total
where rn=1
)



②按照前文同行逻辑,相同类型数据之间相互计算的时候,每多一个维度,数据的一半扩充两倍,另外一半维持不变;不同类型数据之间相互计算的时候,每多一个维度,其中一种数据扩充三倍,另一种维持不变。此时我们从时间、经度、纬度三个维度上进行计算,数据扩充27倍



source as(
select
aid,
data_type,
lat,
lon,
time,
ad.time_part as time_part,
ad.lat_part as lat_part,
ad.lon_part as lon_part
from (
select
aid,
data_type,
lat,
lon,
time,
time_part,
lat_part,
lon_part
from event
where data_type='man'
) source_original
lateral view explode(split(concat(lat_part-1,',',lat_part,',',lat_part+1),',')) ad
as lat_part
lateral view explode(split(concat(lon_part-1,',',lon_part,',',lon_part+1),',')) ad
as lon_part
lateral view explode(split(concat(time_part-1,',',time_part,',',time_part+1),',')) ad
as time_part
),
target as (
select
aid,
data_type,
lat,
lon,
time,
cast(time_part as string) as time_part,
cast(lat_part as string) as lat_part,
cast(lon_part as string) as lon_part
from event where data_type='women'
)



③配对,计算,取满足条件的对



select
source_aid,
target_aid
from (
select
abs(unix_timestamp(source.time)-unix_timestamp(target.time)) as time_diff,
round(6378138*2*asin(sqrt(pow(sin((source.lat*pi()/180-target.lat*pi()/180)/2),2)+cos(source.lat*pi()/180)*cos(target.lat*pi()/180)* pow(sin((source.lon*pi()/180-target.lon*pi()/180)/2),2)))) as site_diff,
source.aid as source_aid,
target.aid as target_aid
from source join target
on source.lon_part=target.lon_part and source.lat_part=target.lat_part and source.time_part=target.time_part
) diff
where time_diff<=10 and site_diff<=150



最终代码



with event as (
select
aid,
data_type,
lat,
lon,
time,
time_part,
lat_part,
lon_partfrom(
select
aid,
data_type,
lat,
lon,
time,
floor(unix_timestamp(time)/10) as time_part,
floor(lat/0.001) as lat_part,
floor(lon/0.001) as lon_part,
row_number() over(partition by aid,time,location order by time) as rn
from test where data_type in ('man','woman')
) total
where rn=1
),
source as(
select
aid,
data_type,
lat,
lon,
time,
ad.time_part as time_part,
ad.lat_part as lat_part,
ad.lon_part as lon_part
from (
select
aid,
data_type,
lat,
lon,
time,
time_part,
lat_part,
lon_part
from event
where data_type='man'
) source_original
lateral view explode(split(concat(lat_part-1,',',lat_part,',',lat_part+1),',')) ad
as lat_part
lateral view explode(split(concat(lon_part-1,',',lon_part,',',lon_part+1),',')) ad
as lon_part
lateral view explode(split(concat(time_part-1,',',time_part,',',time_part+1),',')) ad
as time_part
),
target as (
select
aid,
data_type,
lat,
lon,
time,
cast(time_part as string) as time_part,
cast(lat_part as string) as lat_part,
cast(lon_part as string) as lon_part
from event where data_type='women'
)
select
source_aid,
target_aid
from (
select
abs(unix_timestamp(source.time)-unix_timestamp(target.time)) as time_diff,
round(6378138*2*asin(sqrt(pow(sin((source.lat*pi()/180-target.lat*pi()/180)/2),2)+cos(source.lat*pi()/180)*cos(target.lat*pi()/180)* pow(sin((source.lon*pi()/180-target.lon*pi()/180)/2),2)))) as site_diff,
source.aid as source_aid,
target.aid as target_aid
from source join target
on source.lon_part=target.lon_part and source.lat_part=target.lat_part and source.time_part=target.time_part
) diff
where time_diff<=10 and site_diff<=150



以上就是我的优化方案,所有sql均在spark.sql中执行,优点如下:



1.全程并行计算



2.完美解决边界点问题,没有任何遗漏计算也没有任何重复计算



以上就是本次优化从思考到实现的全过程啦,希望大家喜欢(≧▽≦)



发布于: 2020 年 09 月 22 日阅读数: 39
用户头像

还未添加个人签名 2020.08.11 加入

还未添加个人简介

评论

发布
暂无评论
时空碰撞优化系列·一