Install Spark on existing Hadoop cluster - dryshliak/hadoop GitHub Wiki

Prerequisites

Existing Hadoop cluster
Spark release

Download and extract the Spark distribution. Note it should be Spark binary archive without embedded hadoop

cd ~
wget https://archive.apache.org/dist/spark/spark-2.1.2/spark-2.1.2-bin-without-hadoop.tgz
tar xvzf spark-2.1.2-bin-without-hadoop.tgz
mv spark-2.1.2-bin-without-hadoop spark

Spark primarily relies on the Hadoop setup on the box. So the Spark configuration is primarily telling Spark where Hadoop is on the box. This is done by setting environment variables. Here are the variables to set:

cp ~/spark/conf/spark-defaults.conf.template ~/spark/conf/spark-defaults.conf
cp ~/spark/conf/spark-env.sh.template ~/spark/conf/spark-env.sh
echo "export HADOOP_HOME=/home/ubuntu/hadoop/" >> /home/ubuntu/spark/conf/spark-env.sh
echo "export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop/” >> /home/ubuntu/spark/conf/spark-env.sh
echo "export SPARK_HOME=/home/ubuntu/spark/" >> /home/ubuntu/spark/conf/spark-env.sh
hc=`hadoop classpath`; echo "export SPARK_DIST_CLASSPATH=$hc" >> /home/ubuntu/spark/conf/spark-env.sh

There are two deploy modes that can be used to launch Spark applications on YARN. In cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. In client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN. The following shows how you can run spark-shell in client mode:

./bin/spark-shell

Then you can follow the URL http://node1:8088/cluster/apps to ensure that application has been launched
And http://node1:8088/proxy/application_.../executors/ YARN proxy to spark shell application UI to see executors