Install Spark on existing Hadoop cluster - dryshliak/hadoop GitHub Wiki

Prerequisites

  • Existing Hadoop cluster
  • Spark release
  1. Download and extract the Spark distribution. Note it should be Spark binary archive without embedded hadoop
cd ~
wget https://archive.apache.org/dist/spark/spark-2.1.2/spark-2.1.2-bin-without-hadoop.tgz
tar xvzf spark-2.1.2-bin-without-hadoop.tgz
mv spark-2.1.2-bin-without-hadoop spark
  1. Spark primarily relies on the Hadoop setup on the box. So the Spark configuration is primarily telling Spark where Hadoop is on the box. This is done by setting environment variables. Here are the variables to set:
cp ~/spark/conf/spark-defaults.conf.template ~/spark/conf/spark-defaults.conf
cp ~/spark/conf/spark-env.sh.template ~/spark/conf/spark-env.sh
echo "export HADOOP_HOME=/home/ubuntu/hadoop/" >> /home/ubuntu/spark/conf/spark-env.sh
echo "export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop/” >> /home/ubuntu/spark/conf/spark-env.sh
echo "export SPARK_HOME=/home/ubuntu/spark/" >> /home/ubuntu/spark/conf/spark-env.sh
hc=`hadoop classpath`; echo "export SPARK_DIST_CLASSPATH=$hc" >> /home/ubuntu/spark/conf/spark-env.sh
  1. There are two deploy modes that can be used to launch Spark applications on YARN. In cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. In client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN. The following shows how you can run spark-shell in client mode:
./bin/spark-shell
  1. Then you can follow the URL http://node1:8088/cluster/apps to ensure that application has been launched

  2. And http://node1:8088/proxy/application_.../executors/ YARN proxy to spark shell application UI to see executors