Enabling Spark for Hive on AWS EMR - isgaur/AWS-BigData-Solutions GitHub Wiki
HIVE ON SPARK
In EMR as of now we use two execution engines for Hive: Mapreduce (mr) and TEZ, whereas hive can utilize Spark as its execution engine. Below steps will guide you how to enable Spark execution engine for Hive on EMR cluster:
EMR Version Used: 5.28.0 (also tested with EMR 5.24.1, 5.17.0)
Applications Used: Hive and Spark
- SSH to the Master Node.
SSH to the master node of the EMR after you create an AWS EMR cluster with Hive 2.3.6, Spark 2.4.4, Tez 0.9.2, Livy 0.6.0 applications.
Step 1 : sudo vi /usr/lib/hive/bin/hive
Add the following => after HIVE_LIB
for f in ${SPARK_HOME}/jars/*.jar; do
CLASSPATH=${CLASSPATH}:$f;
done
Step 2: Setup Symbolic link :
sudo ln -s /usr/lib/spark/jars/scala-library-2.11.12.jar /usr/lib/hive/lib/scala-library.jar
sudo ln -s /usr/lib/spark/jars/spark-core_2.11-2.4.3.jar /usr/lib/hive/lib/spark-core.jar
sudo ln -s /usr/lib/spark/jars/spark-network-common_2.11-2.4.3.jar /usr/lib/hive/lib/spark-network-common.jar
sudo ln -s /usr/lib/spark/jars/spark-unsafe_2.11-2.4.3.jar /usr/lib/hive/lib/spark-unsafe.jar
Step3 : Create hdfs directory for spark Jars.Allow Yarn to cache necessary spark dependency jars on nodes so that it does not need to be distributed each time when an application runs. Hive 2.2.0, upload all jars in $SPARK_HOME/jars to hdfs folde
hadoop fs -mkdir /spark-jars
hadoop fs -put /usr/lib/spark/jars/*.jar /spark-jars/
hadoop fs -rm /spark-jars/*hive*1.2.1*
Step 4: To add spark-jar location :
sudo vi /usr/lib/hive/conf/hive-site.xml
<property>
<name>spark.yarn.jars</name>
<value>hdfs://xxxx:8020/spark-jars/*</value>
</property>
<property>
<name>hive.execution.engine</name>
<value>spark</value>
<property>
Step 5: Add following to spark conf :
sudo vi /etc/spark/conf/spark-defaults.conf
spark.sql.hive.metastore.version 2.3.0;
spark.sql.hive.metastore.jars /usr/lib/hive/lib/*:/usr/lib/hadoop/client/*
Step 6 : Stop/Start Services:
In the end run these command to stop and start the following services :
sudo stop hadoop-yarn-timelineserver
sudo stop hadoop-yarn-resourcemanager
sudo stop hadoop-yarn-proxyserver
sudo stop hive-hcatalog-server
sudo stop spark-history-server
sudo stop hive-server2
sudo start hadoop-yarn-timelineserver
sudo start hadoop-yarn-resourcemanager
sudo start hadoop-yarn-proxyserver
sudo start hive-hcatalog-server
sudo start hive-server2
sudo start spark-history-server
For EMR 5.30 & Above -
sudo systemctl stop hadoop-yarn-timelineserver
sudo systemctl stop hadoop-yarn-resourcemanager
sudo systemctl stop hadoop-yarn-proxyserver
sudo systemctl stop hive-hcatalog-server
sudo systemctl stop spark-history-server
sudo systemctl stop hive-server2
sudo systemctl start hadoop-yarn-timelineserver
sudo systemctl start hadoop-yarn-resourcemanager
sudo systemctl start hadoop-yarn-proxyserver
sudo systemctl start hive-hcatalog-server
sudo systemctl start spark-history-server
sudo systemctl start hive-server2
Step 7 : Test Hive on Spark engine -
Run any query in Hive on a table and see how long it takes to complete. Please see attached snapshots how the screen will look after we changed all the above configurations and run hive on spark.
========================================================================================
Reference:
-
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started
-
https://www.linkedin.com/pulse/hive-spark-configuration-common-issues-mohamed-k
-
https://spark.apache.org/docs/latest/sql-distributed-sql-engine.html#running-the-spark-sql-cli
-
https://docs.cloudera.com/documentation/enterprise/latest/topics/admin_hos_tuning.html