SparkML - cchantra/bigdata.github.io GitHub Wiki

download data set

we use dataset from kaggle (bank.csv)

https://www.kaggle.com/rouseguy/bankbalanced/version/1

wget   https://raw.githubusercontent.com/cchantra/bigdata.github.io/master/spark/bank.csv

run to copy bank.csv to hdfs or local

for hdfs


 hdfs dfs -put bank.csv /bank.csv

Install pyspark for user hadoop

pip3 install pyspark --user

Install Jupyter notebook at user hadoop (you can skip this if you have it)


sudo apt update
sudo apt install python3-pip python3-dev
export LC_ALL=C
pip3 install jupyter

It will take a big while. Setup library path for python: add the following line to your .bashrc. Don't forget to source it after that.

export PATH=$PATH:/home/hadoop/.local/bin

export SPARK_HOME=/home/hadoop/spark

export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.8.1-src.zip:$PYTHONPATH

Then

source ~/.bashrc

set up password for jupyter notebook. Optional

jupyter notebook password

Enter password:  <enter your pass>

Verify password: <enter it again>

[NotebookPasswordApp] Wrote hashed password to /home/user1/.jupyter/jupyter_notebook_config.json

you can setup jupyter NO password

jupyter notebook --generate-config

From jupyter directory ,edit the jupyter_notebook_config.py

.jupyter folder at your home. Add the following lines.

c.NotebookApp.token = '' c.NotebookApp.password = u'' c.NotebookApp.open_browser = True c.NotebookApp.ip = 'localhost'

Next create ssh tunnel to jupyter notebook port.

suppose you run jupyter notebook at port 8889, i.e.

 jupyter notebook --no-browser --port=8889

on your local computer, do the ssh tunnel:

ssh -N -L 8889:localhost:8889 [email protected]

where [email protected] is your username@ipaddr_vm

The first 8889 is your local computer port and second 8889 is remote computer port.

Whatever running on the second 8889 is shown in the first 8889 in your local computer.

Then on your web browser. Goto https://localhost:8889

Then open the jupyter notebook with python3 kernel

Try the dataset

copy the following code into it. Click run button.

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('ml-bank').getOrCreate()

df = spark.read.csv('hdfs://localhost:9000/bank.csv', header = True, inferSchema = True)

df.printSchema()

The code is on github to follow. Or get it from

wget   https://raw.githubusercontent.com/cchantra/bigdata.github.io/master/spark/test-pyspark-ml.ipynb

Open it in juypter notebook. Change the location of bank.csv in your hdfs. Run each cell. Don't forget to install pandas.

pip3 install pandas matplotlib

This code tries the model: decision tree, random forest, gradient boosted tree etc. It shows the pipeline of machine learning using SparkML. ROC is used for measurement.

References

https://www.digitalocean.com/community/tutorials/how-to-set-up-jupyter-notebook-with-python-3-on-ubuntu-18-04

https://towardsdatascience.com/machine-learning-with-pyspark-and-mllib-solving-a-binary-classification-problem-96396065d2aa