Tez Hive Execution Engine Configurations while Running Hive queries on AWS EMR

If you set "hive.blobstore.optimizations.enabled=true", the hive job directly writes data to s3 using TEZ engine so it does not use hadoop distcp. But if you set "hive.blobstore.optimizations.enabled=false", the hive job writes data to hdfs using TEZ engine, and hadoop distcp job is initiated and it copies the data from hdfs staging area to s3. You can see through yarn web UI or "yarn application --list" command that one mapreduce job is initiated while running the hive job when you set "hive.blobstore.optimizations.enabled=false". The virtual memory error happened while executing the distcp job and that is why you did not get the error when you set "hive.blobstore.optimizations.enabled=true".

Optimization settings in case of memory failures -

<<<<------- Setting these at the session level - While executing hive queries -->>>>>>>>>>

    set tez.grouping.min-size=167772;
    set hive.tez.container.size=10752;
    set tez.am.resource.memory.mb=15360;

<<<<<<<<<<<<<<<<---------- Setting these at the EMR Cluster level ------>>>>>>>>>>>>>>>>>>>

These configurations has to be configured while provisioning a new EMR Cluster.

/etc/tez/conf/tez-site.xml

            <property>
              <name>tez.am.resource.memory.mb</name>
              <value>15360</value>
            </property>
            <property>
              <name>tez.grouping.min-size</name>
              <value>167772</value>
            </property>

/etc/hive/conf/hive-site.xml

            <property>
              <name>hive.tez.container.size</name>
              <value>10752</value>
            </property>

Yes , these setting are required for these specific applications which you are running using TEZ execution engine.

But having said that you do not have to set these configurations at the EMR cluster level . My strong recommendation is Instead you can provide these configurations inside the .hql query by using SET statement as shown below which would restrict these configurations to be applied at the application ( query) level or if you are running these queries using beeline or any third party BI tool , you can provide these configurations by using SET statements as well for example :

set tez.grouping.min-size=167772; set hive.tez.container.size=10752; set hive.tez.java.opts=-Xmx8600m; set tez.am.resource.memory.mb=15360; set tez.am.launch.cmd-opts=-Xmx12288m;

Again, By setting these properties we have configured the Tez Application Master and Container Size and controlling the number of mappers for splittable formats with Tez. If you would like to learn more how the data is distributed between the TEZ tasks and how the files are passed on to the tasks within the Mappers here is the perfect article which provides really good explanation -

https://cwiki.apache.org/confluence/display/TEZ/How+initial+task+parallelism+works

Lastly the Out-of-memory errors may occur when there are an exceptionally large number of tasks being executed in parallel or there are too many files involved in the split computation. Managing the Application Master configuration can ensure that these types of issues do not occur. This memory is controlled with tez.am.resource.memory.mb. These all configurations of course differs from application to application running. Hence I always recommend to set these properties at the application level rather than setting them at the EMR cluster level so that each application can take advantage of resources accordingly and run efficiently.

There are other couple of tez configurations which are available but purely goes case by case. One must do some experimentation(tests) before using any of these configurations in production environment.Here are some really nice article on TEZ optimization I would like to recommend [1],[2] and [3].

Reference documentation:

[1] https://community.cloudera.com/t5/Community-Articles/Demystify-Apache-Tez-Memory-Tuning-Step-by-Step/ta-p/245279 [2] http://www.openkb.info/2017/05/hive-on-tez-how-to-control-number-of.html [3] https://cwiki.apache.org/confluence/display/TEZ/How+initial+task+parallelism+works

Tez Hive Execution Engine Configurations while Running Hive queries on AWS EMR - isgaur/AWS-BigData-Solutions GitHub Wiki

⚠️ GitHub.com Fallback ⚠️

Tez Hive Execution Engine Configurations while Running Hive queries on AWS EMR - isgaur/AWS-BigData-Solutions GitHub Wiki

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️