Spark, PySpark, and GraphFrame - Nantawat6510545543/big-data-summary GitHub Wiki
Apache Spark Installation
- Download and extract Spark:
wget https://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz
tar xvf spark-3.1.1-bin-hadoop3.2.tgz
mv spark-3.1.1-bin-hadoop3.2 spark
- Add environment variables:
nano ~/.bashrc
Add:
export SPARK_HOME=/home/hadoop/spark
export PATH=$PATH:$SPARK_HOME/bin
Apply:
source ~/.bashrc
- Start Spark:
cd ~/spark
./bin/spark-shell
./bin/spark-shell --master local[4]
Spark MySQL JDBC Installation
- Download and move JDBC driver:
wget https://repo1.maven.org/maven2/com/mysql/mysql-connector-j/8.0.31/mysql-connector-j-8.0.31.jar -O mysql.jar
mv mysql.jar /home/hadoop/spark/jars/
- Add to
.bashrc
:
nano ~/.bashrc
Add:
export CLASSPATH=$CLASSPATH:/home/hadoop/spark/jars
Apply:
source ~/.bashrc
- Launch PySpark with MySQL connector:
cd /home/hadoop/spark
./bin/pyspark --jars jars/mysql.jar
GraphFrames Installation (Spark 3.1.1)
- Download and move GraphFrames jar:
wget https://repos.spark-packages.org/graphframes/graphframes/0.8.2-spark3.1-s_2.12/graphframes-0.8.2-spark3.1-s_2.12.jar
mv graphframes-0.8.2-spark3.1-s_2.12.jar /home/hadoop/spark/jars
- Download and extract Ivy dependencies:
wget https://github.com/cchantra/bigdata.github.io/raw/refs/heads/master/spark/jars.tar.gz
tar xvf jars.tar.gz
mkdir -p ~/.ivy2/jars
mv jars ~/.ivy2/jars
- Create/edit
spark-env.sh
:
cp /home/hadoop/spark/conf/spark-env.sh.template /home/hadoop/spark/conf/spark-env.sh
nano /home/hadoop/spark/conf/spark-env.sh
Add:
export PYTHONPATH=$PYTHONPATH:/home/hadoop/.ivy2/jars:/home/hadoop/spark/jars:.
export PATH=$PATH:/home/hadoop/.local/bin
export SPARK_HOME=/home/hadoop/spark
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.8.1-src.zip:$PYTHONPATH
- Launch PySpark with GraphFrames:
pyspark --jars /home/hadoop/spark/jars/graphframes-0.8.2-spark3.1-s_2.12.jar
Or with package option:
pyspark --packages graphframes:graphframes:0.8.2-spark3.1-s_2.12
Use GraphFrames in Jupyter
PYSPARK_DRIVER_PYTHON=jupyter-lab PYSPARK_DRIVER_PYTHON_OPTS="notebook --no-browser" \
pyspark --py-files graphframes-0.8.2-spark3.1-s_2.12.jar \
--jars graphframes-0.8.2-spark3.1-s_2.12.jar
Download demo notebook:
wget https://raw.githubusercontent.com/cchantra/bigdata.github.io/refs/heads/master/spark/graphframe.ipynb