Spark, PySpark, and GraphFrame - Nantawat6510545543/big-data-summary GitHub Wiki

Apache Spark Installation

  1. Download and extract Spark:
wget https://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz
tar xvf spark-3.1.1-bin-hadoop3.2.tgz
mv spark-3.1.1-bin-hadoop3.2 spark
  1. Add environment variables:
nano ~/.bashrc

Add:

export SPARK_HOME=/home/hadoop/spark
export PATH=$PATH:$SPARK_HOME/bin

Apply:

source ~/.bashrc
  1. Start Spark:
cd ~/spark
./bin/spark-shell
./bin/spark-shell --master local[4]

Spark MySQL JDBC Installation

  1. Download and move JDBC driver:
wget https://repo1.maven.org/maven2/com/mysql/mysql-connector-j/8.0.31/mysql-connector-j-8.0.31.jar -O mysql.jar
mv mysql.jar /home/hadoop/spark/jars/
  1. Add to .bashrc:
nano ~/.bashrc

Add:

export CLASSPATH=$CLASSPATH:/home/hadoop/spark/jars

Apply:

source ~/.bashrc
  1. Launch PySpark with MySQL connector:
cd /home/hadoop/spark
./bin/pyspark --jars jars/mysql.jar

GraphFrames Installation (Spark 3.1.1)

  1. Download and move GraphFrames jar:
wget https://repos.spark-packages.org/graphframes/graphframes/0.8.2-spark3.1-s_2.12/graphframes-0.8.2-spark3.1-s_2.12.jar
mv graphframes-0.8.2-spark3.1-s_2.12.jar /home/hadoop/spark/jars
  1. Download and extract Ivy dependencies:
wget https://github.com/cchantra/bigdata.github.io/raw/refs/heads/master/spark/jars.tar.gz
tar xvf jars.tar.gz
mkdir -p ~/.ivy2/jars
mv jars ~/.ivy2/jars
  1. Create/edit spark-env.sh:
cp /home/hadoop/spark/conf/spark-env.sh.template /home/hadoop/spark/conf/spark-env.sh
nano /home/hadoop/spark/conf/spark-env.sh

Add:

export PYTHONPATH=$PYTHONPATH:/home/hadoop/.ivy2/jars:/home/hadoop/spark/jars:.
export PATH=$PATH:/home/hadoop/.local/bin
export SPARK_HOME=/home/hadoop/spark
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.8.1-src.zip:$PYTHONPATH
  1. Launch PySpark with GraphFrames:
pyspark --jars /home/hadoop/spark/jars/graphframes-0.8.2-spark3.1-s_2.12.jar

Or with package option:

pyspark --packages graphframes:graphframes:0.8.2-spark3.1-s_2.12

Use GraphFrames in Jupyter

PYSPARK_DRIVER_PYTHON=jupyter-lab PYSPARK_DRIVER_PYTHON_OPTS="notebook --no-browser" \ 
pyspark --py-files graphframes-0.8.2-spark3.1-s_2.12.jar \
--jars graphframes-0.8.2-spark3.1-s_2.12.jar

Download demo notebook:

wget https://raw.githubusercontent.com/cchantra/bigdata.github.io/refs/heads/master/spark/graphframe.ipynb