Hive参数说明

## 通用(hive-site.xml)
### JOIN
* `hive.auto.convert.join.noconditionaltask = true`
控制是否打开map-join
* `hive.auto.convert.join.noconditionaltask.size=10000000` 
多大的表可支持map-join，默认为10M.建议是hive.tez.container.size的1/3

### Compile
* `hive.driver.parallel.compilation` 是否允许并发编译，默认为false

### session
* `hive.server2.close.session.on.disconnect` 当客户端关闭时，自动关闭这个客户端发起的session和任务

### tez
* `hive.orc.compute.splits.num.threads`
控制用于计算orc文件splits数量的线程数，默认为10

### split
* `mapreduce.input.fileinputformat.split.maxsize`
在同一个文件内，一个分片的阈值最大值
* `mapreduce.input.fileinputformat.split.minsize`
在同一个文件内，一个分片的阈值最小值

怎么理解这两个参数，下面贴一下OrcInputFormat.SplitGenerator中的一段计算分片源码，这段源码是基于**同一个orc文件**中的stripes进行计算
```
//current是计算stripes合并时的一个状态类
//context.minSize的值来自于mapreduce.input.fileinputformat.split.minsize
//context.maxSize的值来自于mapreduce.input.fileinputformat.split.maxsize
private OffsetAndLength generateOrUpdateSplit(
        List<OrcSplit> splits, OffsetAndLength current, long offset,
        long length, OrcTail orcTail) throws IOException {
      // if we are working on a stripe, over the min stripe size, and
      // crossed a block boundary, cut the input split here.
      //当current中已存在前面stripe的合并数据，且当于的数据大小大于split的最小值，且数据大小超过一个block的大小时（理论上跨过一个block),将当前的current的状态打包成一个split
      if (current.offset != -1 && current.length > context.minSize &&
          (current.offset / blockSize != offset / blockSize)) {
        splits.add(createSplit(current.offset, current.length, orcTail));
        current.offset = -1;
      }
      // if we aren't building a split, start a new one.
      if (current.offset == -1) {
        current.offset = offset;
        current.length = length;
      } else {
        current.length = (offset + length) - current.offset;
      }
      //当超过上限时(注意允许首次超过上限)，打包split
      if (current.length >= context.maxSize) {
        splits.add(createSplit(current.offset, current.length, orcTail));
        current.offset = -1;
      }
      return current;
    }
```

## TEZ(tez-site.xml)
### AM或TASK
* `tez.am.resource.memory.mb=1024`
申请用于启动TezAppMaster的containner大小
* `hive.tez.container.size`
申请用于启动TezChild的containner大小
* `tez.container.max.java.heap.fraction=0.8`
用于控制启动TezAppMaster和TezChild的-Xmx大小，当TezAppMaster和TezChild都未显示指定内存大小时。该值大小在0到1之间
* `tez.am.java.opts`
启动TezAppMaster时分配的内存大小，如hive.tez.java.opts=-Xmx512m。在未配置此参数时，此值等于hive.tez.container.size * tez.container.max.java.heap.fraction
* `hive.tez.java.opts`
启动TezChild时分配的内存大小，如hive.tez.java.opts=-Xmx512m。在未配置此参数时，此值等于hive.tez.container.size * tez.container.max.java.heap.fraction

### 控制Mapper数量
* `set tez.grouping.min-size=16777216;` 
一个mapper最小默认50 MB
* `set tez.grouping.max-size=1073741824;` 
一个mapper最大默认1G
* `tez.grouping.by-length`
该值默认为true。在计算tez的groupSplit过程中，会有一个每个groupSplit平均大小的值,当该配置参数设置为true后，将会限定每个groupSplit的总大小不允许超过平均值。若为默认的false,则允许超过平均值
* `tez.grouping.by-count`
该值默认为false。在计算tez的groupSplit过程中，会有一个每个groupSplit平均分配几个的原始的split,当该配置参数设置为true后，将会限定每个groupSplit的原始split不允许超过平均值。设置为true有可能会提交mapper的总数量，但也保证了不会因为一个mapper的小文件过大，导致运行总时间过长的问题。若为默认的false,则允许超过平均值
*

### 输出
* `tez.runtime.io.sort.mb=512`
用于输出排序的缓存大小，默认512M。建议是hive.tez.container.size的40%
* `tez.runtime.unordered.output.buffer.size-mb=100`
用于输出非排序的缓存大小，默认为100M。建议是hive.tez.container.size的10%

### session
* `tez.session.am.dag.submit.timeout.secs` 控制session在多久空闲后自动关闭，默认为5分钟

阿川CH