Spark Cluster setup *Multi‐node* - cchantra/bigdata.github.io GitHub Wiki

Spark Cluster Setup (Multi-Node)

This guide summarizes how to install and configure a multi-node Apache Spark cluster (1 master + 1 or more workers) on Linux.

Source references


Architecture

Role | Hostname | Example IP -- | -- | -- Master | server1 | IP1 Worker | server2 | IP2

Prerequisites

1. Edit /etc/hosts (on all nodes)

IP1 server1 # master IP2 server2 # worker

Spark Installation (All Nodes)

2. Install Java (OpenJDK 8)

sudo apt update sudo apt install openjdk-8-jdk openjdk-8-jre

3. (Optional) Install Scala

sudo apt install scala

4. Download Apache Spark

Choose one of the following:

https://archive.apache.org/dist/spark/spark-3.3.1/spark-3.3.1-bin-hadoop3.tgz

https://dlcdn.apache.org/spark/spark-3.4.1/spark-3.4.1-bin-hadoop3.tgz

5. Extract and rename

tar -xvzf spark-*-bin-hadoop3.tgz mv spark-*-bin-hadoop3 spark

Master Node Configuration

6. Update .bashrc (example user: hadoop)

export JAVA_HOME="$(jrunscript -e 'java.lang.System.out.println(java.lang.System.getProperty("java.home"));')" export PATH=$PATH:$JAVA_HOME/bin:$JAVA_HOME/jre/bin export CLASSPATH=./ export SPARK_HOME=/home/hadoop/spark export PATH=$PATH:$SPARK_HOME/bin export SPARK_MASTER_HOST='<Master-IP>' export SPARK_MASTER_WEBUI_PORT=8080

Reload:

source ~/.bashrc

7. Copy Spark config templates

cd $SPARK_HOME/conf cp spark-env.sh.template spark-env.sh cp workers.template workers

8. Edit spark-env.sh

export JAVA_HOME="$(jrunscript -e 'java.lang.System.out.println(java.lang.System.getProperty("java.home"));')" export SPARK_WORKER_CORES=4

9. Edit workers

Add worker hostname after localhost:

server2

Worker (Slave) Node Configuration

10. Update .bashrc on worker

export SPARK_LOCAL_IP=10.3.135.xxx export SPARK_MASTER_IP=10.3.135.xxx export SPARK_MASTER_WEBUI_PORT=8080

Reload:

source ~/.bashrc

Start the Spark Cluster

11. Start cluster (on master)

cd $SPARK_HOME sbin/start-all.sh
Screen Shot 2569-01-23 at 13 37 10

12. Verify processes

Master

jps
Screen Shot 2569-01-23 at 13 36 51

Worker

jps
Screen Shot 2569-01-23 at 13 36 58

You should see:

  • Master on server1

  • Worker on server2


Spark Web UI

13. Open firewall on master

sudo ufw allow 8080/tcp

14. Access Web UI

Option A: SSH tunnel (recommended)

ssh -L 8080:10.3.135.xxx:8080 [email protected]

Open browser:

http://localhost:8080

Option B: Direct

http://10.3.135.xxx:8080

⚠️ If workers contains duplicated entries (e.g. server2 twice), multiple workers will appear in the UI.


Preparation for Coding (PySpark + Jupyter)

15. Install pip3

sudo apt install python3-pip

16. Install JupyterLab

pip3 install jupyterlab --user

(Optional) Jupyter Configuration

17. Generate config

jupyter server --generate-config

18. Edit jupyter_server_config.py

Example (passwordless, remote access):

c.NotebookApp.token = '' c.NotebookApp.password = u'' c.NotebookApp.open_browser = False c.NotebookApp.port = 8887 c.NotebookApp.ip = '0.0.0.0' c.NotebookApp.allow_remote_access = True

Docs: https://docs.jupyter.org/en/latest/use/jupyter-directories.html


Run Jupyter as a Service (Optional)

19. Create service file

sudo vi /etc/systemd/system/Jupyter.service
[Unit] Description=JupyterLab [Service] Type=simple User=hadoop Group=hadoop WorkingDirectory=/home/hadoop ExecStart=jupyter-lab Restart=on-failure RestartSec=5s [Install] WantedBy=multi-user.target

20. Enable and start

sudo systemctl enable Jupyter.service sudo systemctl start Jupyter.service sudo systemctl status Jupyter.service

21. Allow Jupyter port

sudo ufw allow 8887/tcp

Run jupyter on master. Try to run the code example Note at “setMaster()” to set to run on this cluster

Screen Shot 2569-01-23 at 13 44 41

(Ref: https://notebook.community/mohanprasath/BigDataExercises/week4/Spark%20Cluster%20Setup) Screen Shot 2569-01-23 at 13 48 26


Run Spark Job from Jupyter (Example)

22. PySpark Example (Pi Calculation)

from random import random from operator import add from pyspark.sql import SparkSession

spark = SparkSession.builder
.appName(

"pi")
.master(
"spark://server1:7077")
.getOrCreate()

partitions =

100000 n = 100000 * partitions def f(_): x = random() * 2 - 1 y = random() * 2 - 1 return 1 if x**2 + y**2 <= 1 else 0

count = spark.sparkContext
.parallelize(

range(1, n + 1), partitions)
.
map(f)
.reduce(add)
print("Pi is roughly %f" % (4.0 * count / n)) spark.stop()
Screen Shot 2569-01-23 at 13 50 11

Verification

  • Check Spark Web UI → Jobs / Executors

  • Screen Shot 2569-01-23 at 13 50 16
  • Confirm tasks are distributed across workers

  • Observe CPU & memory usage


Next Exploration

Try varying:

  • Number of workers

  • SPARK_WORKER_CORES

  • Number of partitions

👉 Observe how execution time and resource utilization change.

⚠️ **GitHub.com Fallback** ⚠️