SparkML - cchantra/bigdata.github.io GitHub Wiki
download data set
we use dataset from kaggle (bank.csv)
https://www.kaggle.com/rouseguy/bankbalanced/version/1
wget https://raw.githubusercontent.com/cchantra/bigdata.github.io/master/spark/bank.csv
run to copy bank.csv to hdfs or local
for hdfs
hdfs dfs -put bank.csv /bank.csv
Install pyspark for user hadoop
pip3 install pyspark --user
Install Jupyter notebook at user hadoop (you can skip this if you have it)
sudo apt update
sudo apt install python3-pip python3-dev
export LC_ALL=C
pip3 install jupyter
It will take a big while. Setup library path for python: add the following line to your .bashrc. Don't forget to source it after that.
export PATH=$PATH:/home/hadoop/.local/bin
export SPARK_HOME=/home/hadoop/spark
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.8.1-src.zip:$PYTHONPATH
Then
source ~/.bashrc
set up password for jupyter notebook. Optional
jupyter notebook password
Enter password: <enter your pass>
Verify password: <enter it again>
[NotebookPasswordApp] Wrote hashed password to /home/user1/.jupyter/jupyter_notebook_config.json
you can setup jupyter NO password
jupyter notebook --generate-config
From jupyter directory ,edit the jupyter_notebook_config.py
.jupyter folder at your home. Add the following lines.
c.NotebookApp.token = '' c.NotebookApp.password = u'' c.NotebookApp.open_browser = True c.NotebookApp.ip = 'localhost'
Next create ssh tunnel to jupyter notebook port.
suppose you run jupyter notebook at port 8889, i.e.
jupyter notebook --no-browser --port=8889
on your local computer, do the ssh tunnel:
ssh -N -L 8889:localhost:8889 [email protected]
where [email protected] is your username@ipaddr_vm
The first 8889 is your local computer port and second 8889 is remote computer port.
Whatever running on the second 8889 is shown in the first 8889 in your local computer.
Then on your web browser. Goto https://localhost:8889
Then open the jupyter notebook with python3 kernel
Try the dataset
copy the following code into it. Click run button.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('ml-bank').getOrCreate()
df = spark.read.csv('hdfs://localhost:9000/bank.csv', header = True, inferSchema = True)
df.printSchema()
The code is on github to follow. Or get it from
wget https://raw.githubusercontent.com/cchantra/bigdata.github.io/master/spark/test-pyspark-ml.ipynb
Open it in juypter notebook. Change the location of bank.csv in your hdfs. Run each cell. Don't forget to install pandas.
pip3 install pandas matplotlib
This code tries the model: decision tree, random forest, gradient boosted tree etc. It shows the pipeline of machine learning using SparkML. ROC is used for measurement.