Spark Cluster setup *Multi‐node* - cchantra/bigdata.github.io GitHub Wiki
This guide summarizes how to install and configure a multi-node Apache Spark cluster (1 master + 1 or more workers) on Linux.
Source references
-
DataFlair: https://data-flair.training/blogs/install-apache-spark-multi-node-cluster/
-
LinkedIn ref: https://www.linkedin.com/pulse/how-setup-install-apache-spark-311-cluster-ubuntu-shrivastava/
-
Spark example: https://notebook.community/mohanprasath/BigDataExercises/week4/Spark%20Cluster%20Setup
IP1 server1 # master IP2 server2 # worker
sudo apt update sudo apt install openjdk-8-jdk openjdk-8-jre
sudo apt install scala
Choose one of the following:
https://archive.apache.org/dist/spark/spark-3.3.1/spark-3.3.1-bin-hadoop3.tgzhttps://dlcdn.apache.org/spark/spark-3.4.1/spark-3.4.1-bin-hadoop3.tgz
tar -xvzf spark-*-bin-hadoop3.tgz mv spark-*-bin-hadoop3 spark
export JAVA_HOME="$(jrunscript -e 'java.lang.System.out.println(java.lang.System.getProperty("java.home"));')" export PATH=$PATH:$JAVA_HOME/bin:$JAVA_HOME/jre/bin export CLASSPATH=./ export SPARK_HOME=/home/hadoop/spark export PATH=$PATH:$SPARK_HOME/bin export SPARK_MASTER_HOST='<Master-IP>' export SPARK_MASTER_WEBUI_PORT=8080
Reload:
source ~/.bashrc
cd $SPARK_HOME/conf cp spark-env.sh.template spark-env.sh cp workers.template workers
export JAVA_HOME="$(jrunscript -e 'java.lang.System.out.println(java.lang.System.getProperty("java.home"));')" export SPARK_WORKER_CORES=4
Add worker hostname after localhost:
server2
export SPARK_LOCAL_IP=10.3.135.xxx export SPARK_MASTER_IP=10.3.135.xxx export SPARK_MASTER_WEBUI_PORT=8080
Reload:
source ~/.bashrc
cd $SPARK_HOME sbin/start-all.sh
Master
jps
Worker
jps
You should see:
-
Masteron server1 -
Workeron server2
sudo ufw allow 8080/tcp
Option A: SSH tunnel (recommended)
ssh -L 8080:10.3.135.xxx:8080 [email protected]
Open browser:
http://localhost:8080
Option B: Direct
http://10.3.135.xxx:8080
⚠️ Ifworkerscontains duplicated entries (e.g.server2twice), multiple workers will appear in the UI.
sudo apt install python3-pip
pip3 install jupyterlab --user
jupyter server --generate-config
Example (passwordless, remote access):
c.NotebookApp.token = '' c.NotebookApp.password = u'' c.NotebookApp.open_browser = False c.NotebookApp.port = 8887 c.NotebookApp.ip = '0.0.0.0' c.NotebookApp.allow_remote_access = True
Docs: https://docs.jupyter.org/en/latest/use/jupyter-directories.html
sudo vi /etc/systemd/system/Jupyter.service
[Unit] Description=JupyterLab [Service] Type=simple User=hadoop Group=hadoop WorkingDirectory=/home/hadoop ExecStart=jupyter-lab Restart=on-failure RestartSec=5s [Install] WantedBy=multi-user.target
sudo systemctl enable Jupyter.service sudo systemctl start Jupyter.service sudo systemctl status Jupyter.service
sudo ufw allow 8887/tcp
Run jupyter on master. Try to run the code example Note at “setMaster()” to set to run on this cluster
(Ref: https://notebook.community/mohanprasath/BigDataExercises/week4/Spark%20Cluster%20Setup)

from random import random from operator import add from pyspark.sql import SparkSessionspark = SparkSession.builder
"pi")
.appName(
.master("spark://server1:7077")
.getOrCreate()partitions =
100000 n = 100000 * partitions def f(_): x = random() * 2 - 1 y = random() * 2 - 1 return 1 if x**2 + y**2 <= 1 else 0count = spark.sparkContext
range(1, n + 1), partitions)
.parallelize(
.map(f)
.reduce(add) print("Pi is roughly %f" % (4.0 * count / n)) spark.stop()
-
Check Spark Web UI → Jobs / Executors
-
Confirm tasks are distributed across workers
-
Observe CPU & memory usage
Try varying:
-
Number of workers
-
SPARK_WORKER_CORES -
Number of partitions
👉 Observe how execution time and resource utilization change.