Hive技巧:通过增加冗余分区来简化sql复杂度，提高并行度

## 背景
需要合并多路的数据到同一张表中时，如果用到了`union all`语句,那么会提高sql的执行复杂度。比如有这么个场景，每天保存历史的新增用户，那么处理逻辑是，将当天新增的用户加上昨天的历史新增用户。
假如新增历史表的结构为
```sql
-- ds表示日期，格式为yyyy-MM-dd
create table new_user_history(userid string)
partitioned by(ds string)
```
通常，清洗的sql是
```sql
insert overwrite table new_user_history partition(ds='2018-03-22') 
select * from (
-- 计算当天新增的
select * from ...
union all 
-- 昨天之前的历史新增用户
select * from new_user_history where ds=date_sub('2018-03-22',1)
)
```
这条sql在最终执行的时候有多复杂，截取了生产的一段执行计划
![](/api/file/getImage?fileId=5aba0e17418f8a54f600008a)

(执行计划是跑在`MR on hive`的模式下, 如果是跑在`tez on hive`模式下的话，则会精简的许多，它是通过在输出时增加一层的文件夹来简化数据合并，但却带来了一定的副作用，比如presto就认不到那多出来的一层文件夹导致读不到数据，具体问题，请看另一篇文章《hive能查出数据而presto查不出数据的原因》)

## 优化
在建表的时候直接增加一层的冗余分区
```sql
-- ds表示日期，格式为yyyy-MM-dd
create table new_user_history(userid string)
partitioned by(ds string, part string)
```
清洗的sql就可以变为
```sql
insert overwrite table new_user_history partition(ds='2018-03-22', part='today') 
-- 计算当天新增的
select * from ...

insert overwrite table new_user_history partition(ds='2018-03-22', part='former') 
select * from new_user_history where ds=date_sub('2018-03-22',1)
```
这样降低了sql的复杂度，提高了执行并行度，同时避免了跑在tez下时出现的一些副作用

本文实际上也是受到tez处理方式的启发，与其让tez去给你隐性的增加一层的文件夹，还不如主动的增加一层的分区

阿川CH