ICP 14 - awais546/Big-Data-Programming-Hadoop-Pyspark GitHub Wiki
Big Data Programming Hadoop/Pyspark
Spark Mlib
Introduction
MLlib is Spark's scalable machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as underlying optimization primitives
Tasks
1. Classification
Naïve Bayes
Use the following line of code to import the Naive Bayes.
from pyspark.ml.classification import NaiveBayes
Use the below line of code to import Decision Tree Classifier
from pyspark.ml.classification import DecisionTreeClassifier
Use the following line to import Random Forest classifier
from pyspark.ml.classification import RandomForestClassifier
To download the dataset go to the following link.
https://archive.ics.uci.edu/ml/datasets/Adult
The results of the three classifiers are shown below in the below screenshot.
Decision tree gives the best accuracy. The difference between all these algorithms are given below.
Naive Bayes vs Decision Tree
The probability between Decision tree and Random forest is almost equal.
When to Use Decision Tree?
-
When you want your model to be simple and explainable
-
When you want non parametric model
-
When you don't want to worry about feature selection or regularization or worry about multi-collinearity.
-
You can overfit the tree and build a model if you are sure of validation or test data set is going to be subset of training data set or almost overlapping instead of unexpected.
When to use Random Forest?
-
When you don't bother much about interpreting the model but want better accuracy.
-
Random forest will reduce variance part of error rather than bias part, so on a given training data set decision tree may be more accurate than a random forest. But on an unexpected validation data set, Random forest always wins in terms of accuracy.
2. Clustering
KMean Clustering
In order to use Kmean Clustering use the following line of code.
from pyspark.ml.clustering import KMeans
Use the following link to download the dataset.
https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008
Following are the result of Kmean Clustering.
The K-means algorithm identifies k number of centroids, and then allocates every data point to the nearest cluster, while keeping the centroids as small as possible.
3. Regression
- Linear regression
Simple linear regression is useful for finding relationship between two continuous variables. One is predictor or independent variable and other is response or dependent variable. It looks for statistical relationship but not deterministic relationship.
To import linear regression module use the following piece of code.
from pyspark.ml.regression import LinearRegression
To download the dataset use the following link.
https://archive.ics.uci.edu/ml/datasets/Automobile
The root mean square error is as follows.
- Logistic Regression
Use the following line to import logistic regression.
Logistic regression is a statistical model that in its basic form uses a logistic function to model a binary dependent variable, although many more complex extensions exist. In regression analysis, logistic regression (or logit regression) is estimating the parameters of a logistic model (a form of binary regression).
from pyspark.ml.classification import LogisticRegression
The results are as follows.
Difference between classification and clustering