Spark Cluster Setup (Multi-Node)

This guide summarizes how to install and configure a multi-node Apache Spark cluster (1 master + 1 or more workers) on Linux.

Source references

Architecture

Prerequisites

1. Edit `/etc/hosts` (on all nodes)



IP1   server1   # master
IP2   server2   # worker

Spark Installation (All Nodes)

2. Install Java (OpenJDK 8)



sudo apt update
sudo apt install openjdk-8-jdk openjdk-8-jre

3. (Optional) Install Scala



sudo apt install scala

4. Download Apache Spark

Choose one of the following:



https://archive.apache.org/dist/spark/spark-3.3.1/spark-3.3.1-bin-hadoop3.tgz
https://dlcdn.apache.org/spark/spark-3.4.1/spark-3.4.1-bin-hadoop3.tgz

5. Extract and rename



tar -xvzf spark-*-bin-hadoop3.tgz
mv spark-*-bin-hadoop3 spark

Master Node Configuration

6. Update `.bashrc` (example user: `hadoop`)



export JAVA_HOME="$(jrunscript -e 'java.lang.System.out.println(java.lang.System.getProperty("java.home"));')"
export PATH=$PATH:$JAVA_HOME/bin:$JAVA_HOME/jre/bin
export CLASSPATH=./
export SPARK_HOME=/home/hadoop/spark
export PATH=$PATH:$SPARK_HOME/bin
export SPARK_MASTER_HOST='<Master-IP>'
export SPARK_MASTER_WEBUI_PORT=8080

Reload:



source ~/.bashrc

7. Copy Spark config templates



cd $SPARK_HOME/conf
cp spark-env.sh.template spark-env.sh
cp workers.template workers

8. Edit `spark-env.sh`



export JAVA_HOME="$(jrunscript -e 'java.lang.System.out.println(java.lang.System.getProperty("java.home"));')"
export SPARK_WORKER_CORES=4

9. Edit `workers`

Add worker hostname after localhost:



server2

Worker (Slave) Node Configuration

10. Update `.bashrc` on worker



export SPARK_LOCAL_IP=10.3.135.xxx
export SPARK_MASTER_IP=10.3.135.xxx
export SPARK_MASTER_WEBUI_PORT=8080

Reload:



source ~/.bashrc

Start the Spark Cluster

11. Start cluster (on master)



cd $SPARK_HOME
sbin/start-all.sh

12. Verify processes

Master

jps

Worker

jps

You should see:

Master on server1
Worker on server2

Spark Web UI

13. Open firewall on master



sudo ufw allow 8080/tcp

14. Access Web UI

Option A: SSH tunnel (recommended)



ssh -L 8080:10.3.135.xxx:8080 [email protected]

Open browser:



http://localhost:8080

Option B: Direct



http://10.3.135.xxx:8080

⚠️ If workers contains duplicated entries (e.g. server2 twice), multiple workers will appear in the UI.

Preparation for Coding (PySpark + Jupyter)

15. Install pip3



sudo apt install python3-pip

16. Install JupyterLab



pip3 install jupyterlab --user

(Optional) Jupyter Configuration

17. Generate config



jupyter server --generate-config

18. Edit `jupyter_server_config.py`

Example (passwordless, remote access):



c.NotebookApp.token = ''
c.NotebookApp.password = u''
c.NotebookApp.open_browser = False
c.NotebookApp.port = 8887
c.NotebookApp.ip = '0.0.0.0'
c.NotebookApp.allow_remote_access = True

Docs: https://docs.jupyter.org/en/latest/use/jupyter-directories.html

Run Jupyter as a Service (Optional)

19. Create service file



sudo vi /etc/systemd/system/Jupyter.service



[Unit]
Description=JupyterLab
[Service]
Type=simple
User=hadoop
Group=hadoop
WorkingDirectory=/home/hadoop
ExecStart=jupyter-lab
Restart=on-failure
RestartSec=5s
[Install]
WantedBy=multi-user.target

20. Enable and start



sudo systemctl enable Jupyter.service
sudo systemctl start Jupyter.service
sudo systemctl status Jupyter.service

21. Allow Jupyter port



sudo ufw allow 8887/tcp

Run jupyter on master. Try to run the code example Note at “setMaster()” to set to run on this cluster

(Ref: https://notebook.community/mohanprasath/BigDataExercises/week4/Spark%20Cluster%20Setup) Screen Shot 2569-01-23 at 13 48 26

Run Spark Job from Jupyter (Example)

22. PySpark Example (Pi Calculation)



from random import random
from operator import add
from pyspark.sql import SparkSession
spark = SparkSession.builder 

.appName("pi") 

.master("spark://server1:7077") 

.getOrCreate()
partitions = 100000
n = 100000 * partitions
def f(_):
x = random() * 2 - 1
y = random() * 2 - 1
return 1 if x**2 + y**2 <= 1 else 0
count = spark.sparkContext 

.parallelize(range(1, n + 1), partitions) 

.map(f) 

.reduce(add)
print("Pi is roughly %f" % (4.0 * count / n))
spark.stop()

Verification

Check Spark Web UI → Jobs / Executors

Confirm tasks are distributed across workers
Observe CPU & memory usage

Next Exploration

Try varying:

Number of workers
SPARK_WORKER_CORES
Number of partitions

👉 Observe how execution time and resource utilization change.

Spark Cluster setup Multi‐node - cchantra/bigdata.github.io GitHub Wiki

Spark Cluster Setup (Multi-Node)

Architecture

Prerequisites

1. Edit `/etc/hosts` (on all nodes)

Spark Installation (All Nodes)

2. Install Java (OpenJDK 8)

3. (Optional) Install Scala

4. Download Apache Spark

5. Extract and rename

Master Node Configuration

6. Update `.bashrc` (example user: `hadoop`)

7. Copy Spark config templates

8. Edit `spark-env.sh`

9. Edit `workers`

Worker (Slave) Node Configuration

10. Update `.bashrc` on worker

Start the Spark Cluster

11. Start cluster (on master)

12. Verify processes

Spark Web UI

13. Open firewall on master

14. Access Web UI

Preparation for Coding (PySpark + Jupyter)

15. Install pip3

16. Install JupyterLab

(Optional) Jupyter Configuration

17. Generate config

18. Edit `jupyter_server_config.py`

Run Jupyter as a Service (Optional)

19. Create service file

20. Enable and start

21. Allow Jupyter port

Run Spark Job from Jupyter (Example)

22. PySpark Example (Pi Calculation)

Verification

Next Exploration

⚠️ GitHub.com Fallback ⚠️

Spark Cluster setup *Multi‐node* - cchantra/bigdata.github.io GitHub Wiki

Spark Cluster Setup (Multi-Node)

Architecture

Prerequisites

1. Edit /etc/hosts (on all nodes)

Spark Installation (All Nodes)

2. Install Java (OpenJDK 8)

3. (Optional) Install Scala

4. Download Apache Spark

5. Extract and rename

Master Node Configuration

6. Update .bashrc (example user: hadoop)

7. Copy Spark config templates

8. Edit spark-env.sh

9. Edit workers

Worker (Slave) Node Configuration

10. Update .bashrc on worker

Start the Spark Cluster

11. Start cluster (on master)

12. Verify processes

Spark Web UI

13. Open firewall on master

14. Access Web UI

Preparation for Coding (PySpark + Jupyter)

15. Install pip3

16. Install JupyterLab

(Optional) Jupyter Configuration

17. Generate config

18. Edit jupyter_server_config.py

Run Jupyter as a Service (Optional)

19. Create service file

20. Enable and start

21. Allow Jupyter port

Run Spark Job from Jupyter (Example)

22. PySpark Example (Pi Calculation)

Verification

Next Exploration

⚠️ **GitHub.com Fallback** ⚠️

Spark Cluster setup Multi‐node - cchantra/bigdata.github.io GitHub Wiki

1. Edit `/etc/hosts` (on all nodes)

6. Update `.bashrc` (example user: `hadoop`)

8. Edit `spark-env.sh`

9. Edit `workers`

10. Update `.bashrc` on worker

18. Edit `jupyter_server_config.py`

⚠️ GitHub.com Fallback ⚠️