Enabling Spark for Hive on AWS EMR - isgaur/AWS-BigData-Solutions GitHub Wiki

HIVE ON SPARK

In EMR as of now we use two execution engines for Hive: Mapreduce (mr) and TEZ, whereas hive can utilize Spark as its execution engine. Below steps will guide you how to enable Spark execution engine for Hive on EMR cluster:

EMR Version Used: 5.28.0 (also tested with EMR 5.24.1, 5.17.0)

Applications Used: Hive and Spark

======== Steps to setup Hive on Spark

  1. SSH to the Master Node.

SSH to the master node of the EMR after you create an AWS EMR cluster with Hive 2.3.6, Spark 2.4.4, Tez 0.9.2, Livy 0.6.0 applications.

Step 1 : sudo vi /usr/lib/hive/bin/hive

    Add the following => after HIVE_LIB

    for f in ${SPARK_HOME}/jars/*.jar; do
         CLASSPATH=${CLASSPATH}:$f;
    done

Step 2: Setup Symbolic link :

    sudo ln -s /usr/lib/spark/jars/scala-library-2.11.12.jar /usr/lib/hive/lib/scala-library.jar
    sudo ln -s /usr/lib/spark/jars/spark-core_2.11-2.4.3.jar /usr/lib/hive/lib/spark-core.jar
    sudo ln -s /usr/lib/spark/jars/spark-network-common_2.11-2.4.3.jar /usr/lib/hive/lib/spark-network-common.jar
    sudo ln -s /usr/lib/spark/jars/spark-unsafe_2.11-2.4.3.jar /usr/lib/hive/lib/spark-unsafe.jar

Step3 : Create hdfs directory for spark Jars.Allow Yarn to cache necessary spark dependency jars on nodes so that it does not need to be distributed each time when an application runs. Hive 2.2.0, upload all jars in $SPARK_HOME/jars to hdfs folde

    hadoop fs -mkdir /spark-jars

    hadoop fs -put /usr/lib/spark/jars/*.jar /spark-jars/

    hadoop fs -rm /spark-jars/*hive*1.2.1*

Step 4: To add spark-jar location :

sudo vi /usr/lib/hive/conf/hive-site.xml

  <property>
    <name>spark.yarn.jars</name>
    <value>hdfs://xxxx:8020/spark-jars/*</value>
  </property>

  <property>
        <name>hive.execution.engine</name>
        <value>spark</value>
  <property>

Step 5: Add following to spark conf :

sudo vi /etc/spark/conf/spark-defaults.conf

spark.sql.hive.metastore.version        2.3.0;
spark.sql.hive.metastore.jars       /usr/lib/hive/lib/*:/usr/lib/hadoop/client/*

Step 6 : Stop/Start Services:

In the end run these command to stop and start the following services :

    sudo stop hadoop-yarn-timelineserver 
    sudo stop hadoop-yarn-resourcemanager 
    sudo stop hadoop-yarn-proxyserver 
    sudo stop hive-hcatalog-server
    sudo stop spark-history-server
    sudo stop hive-server2

    sudo start hadoop-yarn-timelineserver 
    sudo start hadoop-yarn-resourcemanager 
    sudo start hadoop-yarn-proxyserver 
    sudo start hive-hcatalog-server
    sudo start hive-server2
    sudo start spark-history-server

    For EMR 5.30 & Above -

    sudo systemctl stop hadoop-yarn-timelineserver 
    sudo systemctl stop hadoop-yarn-resourcemanager 
    sudo systemctl stop hadoop-yarn-proxyserver 
    sudo systemctl stop hive-hcatalog-server
    sudo systemctl stop spark-history-server
    sudo systemctl stop hive-server2

    sudo systemctl start hadoop-yarn-timelineserver 
    sudo systemctl start hadoop-yarn-resourcemanager 
    sudo systemctl start hadoop-yarn-proxyserver 
    sudo systemctl start hive-hcatalog-server
    sudo systemctl start spark-history-server
    sudo systemctl start hive-server2

Step 7 : Test Hive on Spark engine -

Run any query in Hive on a table and see how long it takes to complete. Please see attached snapshots how the screen will look after we changed all the above configurations and run hive on spark.

========================================================================================

Reference:

  1. https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started

  2. https://www.linkedin.com/pulse/hive-spark-configuration-common-issues-mohamed-k

  3. https://spark.apache.org/docs/latest/sql-distributed-sql-engine.html#running-the-spark-sql-cli

  4. https://docs.cloudera.com/documentation/enterprise/latest/topics/admin_hos_tuning.html

⚠️ **GitHub.com Fallback** ⚠️