ICP 14: Apache Spark MLIB - acikgozmehmet/BigDataProgramming GitHub Wiki
ICP 14: Apache Spark MLIB
Objectives
- Clustering
- Classification
- Regression
- Recommendation
Overview
MLlib is Spark’s machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. At a high level, it provides tools such as:
ML Algorithms: common learning algorithms such as classification, regression, clustering, and collaborative filtering
- Featurization: feature extraction, transformation, dimensionality reduction, and selection
- Pipelines: tools for constructing, evaluating, and tuning ML Pipelines
- Persistence: saving and load algorithms, models, and Pipelines
- Utilities: linear algebra, statistics, data handling, etc.
Spark MLlib is Apache Spark’s Machine Learning component. One of the major attractions of Spark is the ability to scale computation massively, and that is exactly what you need for machine learning algorithms. But the limitation is that all machine learning algorithms cannot be effectively parallelized. Each algorithm has its own challenges for parallelization, whether it is task parallelism or data parallelism.
Having said that, Spark is becoming the de-facto platform for building machine learning algorithms and applications. Well, you can check out the Spark course curriculum curated by Industry Experts before going ahead with the blog. The developers working on the Spark MLlib are implementing more and more machine algorithms in a scalable and concise manner in the Spark framework. Through this blog, we will learn the concepts of Machine Learning, Spark MLlib, its utilities, algorithms and a complete use case of Movie Recommendation System.
Installation Requirements
- Pyspark is used with 2.1.0.
In Class Programming
1. Classification:
Dataset: https://archive.ics.uci.edu/ml/datasets/Adult
We are using following algorithms against the adult dataset given above.
- Naïve Bayes
- Decision Tree
- Random Forest
This is a classification problem. The column X in the dataset is for the people who make more or less than 50K. The label in the dataset is column X while the features (predictors) are age, education-num and hours-per-week.
Creating spark session, data loading and data manipulations are exactly same in these algorithms. That's why it is better to show how these steps are implemented in here only once.
// Creating spark session
spark = SparkSession.builder.appName("DecisionTree App").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
// Loading the data
data = spark.read.format("csv").option("header", True) \
.option("inferSchema", True) \
.option("delimiter", ",") \
.load("D:\\UMKC\\__Spring2020\\CS5590BDP\Module-2\\Lesson-7\\MachineLearning\\data\\adult.data")
data.printSchema()
// creating categorical label value
data = data.withColumn("X", F.when(F.col("X") == ' <=50K', 0).when(F.col("X") == ' >50K', 1))
data = data.withColumnRenamed("X", "label")
data = data.select(data.label.cast("double"),"age", "education-num", "hours-per-week")
data.show()
assembler = VectorAssembler(inputCols=data.columns[1:], outputCol="features")
data = assembler.transform(data)
data.show()
// Splitting the data into training and data set
training, test = data.select("label","features").randomSplit([0.70, 0.30])
a. Naïve Bayes
Naive Bayes is a simple multiclass classification algorithm with the assumption of independence between every pair of features. Naive Bayes can be trained very efficiently. Within a single pass to the training data, it computes the conditional probability distribution of each feature given label, and then it applies Bayes’ theorem to compute the conditional probability distribution of label given an observation and use it for prediction.
NaiveBayes implements multinomial naive Bayes. It takes an RDD of LabeledPoint and an optional smoothing parameter lambda as input, an optional model type parameter (default is “multinomial”), and outputs a NaiveBayesModel, which can be used for evaluation and prediction.
// Create Navie Bayes model and fit the model with training dataset
nb = NaiveBayes()
model = nb.fit(training)
// Generate prediction from test dataset
pred = model.transform(test)
// Evaluate the accuracy of the model
evaluator = MulticlassClassificationEvaluator()
accuracy = evaluator.evaluate(pred)
// Show model accuracy
print("Accuracy:\n\n", accuracy)
// Report
predAndLabels = pred.select("prediction", "label").rdd
metrics = MulticlassMetrics(predAndLabels)
print("Confusion Matrix", metrics.confusionMatrix())
print("Precision", metrics.precision())
print("Recall", metrics.recall())
print("F-measure", metrics.fMeasure())
b. Decision Tree
Decision trees are widely used since they are easy to interpret, handle categorical features, extend to the multiclass classification setting, do not require feature scaling, and are able to capture non-linearities and feature interactions. The decision tree is a greedy algorithm that performs a recursive binary partitioning of the feature space. The tree predicts the same label for each bottommost (leaf) partition. Each partition is chosen greedily by selecting the best split from a set of possible splits, in order to maximize the information gain at a tree node.
dt =DecisionTreeClassifier()
model = dt.fit(training)
// Predictions
pred = model.transform(test)
//Accuracy
evaluator = MulticlassClassificationEvaluator()
accuracy = evaluator.evaluate(pred)
print("Accuracy", accuracy)
//Report
predAndLabels = pred.select("prediction", "label").rdd
metrics = MulticlassMetrics(predAndLabels)
print("Confusion Matrix", metrics.confusionMatrix())
print("Precision", metrics.precision())
print("Recall", metrics.recall())
print("F-measure", metrics.fMeasure())
c. Random Forest
Random forests are a popular family of classification and regression methods. Random forests are ensembles of decision trees. Random forests combine many decision trees in order to reduce the risk of overfitting.
// Splitting the data into training and data set
training, test = data.select("label","features").randomSplit([0.70, 0.30])
// Create Random Forest model and fit the model with training dataset
rf = RandomForestClassifier()
model = rf.fit(training)
// Generate prediction from test dataset
pred = model.transform(test)
// Evaluate the accuracy of the model
evaluator = MulticlassClassificationEvaluator()
accuracy = evaluator.evaluate(pred)
// Show model accuracy
print("Accuracy:", accuracy)
// Report
predictionAndLabels = pred.select("prediction", "label").rdd
metrics = MulticlassMetrics(predictionAndLabels)
print("Confusion Matrix:", metrics.confusionMatrix())
print("Precision:", metrics.precision())
print("Recall:", metrics.recall())
print("F-measure:", metrics.fMeasure())
2. Clustering:
[Dataset](• https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008)
In order to identify diabetes, the following set of features are used to train the model.
- admission_type_id
- discharge_disposition_id
- admission_source_id
- time_in_hospital
- num_lab_procedures
spark = SparkSession.builder.appName("DecisionTree App").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
// Loading the data
data = spark.read.format("csv").option("header", True) .option("inferSchema", True) .option("delimiter", ",") \
.load("D:\\UMKC\\__Spring2020\\CS5590BDP\\Module-2\\Lesson-7\\MachineLearning\\data\\diabetic_data.csv")
data = data.select("admission_type_id", "discharge_disposition_id", "admission_source_id", "time_in_hospital", "num_lab_procedures")
data.show()
assembler = VectorAssembler(inputCols=data.columns, outputCol="features")
data = assembler.transform(data)
data.show()
// Trains a k-means model.
kmeans = KMeans().setK(2).setSeed(1)
model = kmeans.fit(data)
// Make predictions
predictions = model.transform(data)
// Shows the result.
centers = model.clusterCenters()
print("Cluster Centers: ")
for center in centers:
print(center)
3. Regression:
Dataset : https://archive.ics.uci.edu/ml/datasets/Automobile
In order to predict the wheel-base, the following features are used to train the model
- length
- width
- height
Creating spark session and data loading
spark = SparkSession.builder.appName("LinerReg App").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
// Loading the data
data = spark.read.format("csv").option("header", True).option("inferSchema", True).option("delimiter", ",") \
.load("D:\\UMKC\\__Spring2020\\CS5590BDP\\Module-2\\Lesson-7\\MachineLearning\\data\\imports-85.data")
data.printSchema()
a. Linear Regression
data = data.withColumnRenamed("wheel-base","label").select("label", "length", "width", "height")
data.show()
assembler = VectorAssembler(inputCols=data.columns[1:], outputCol="features")
data = assembler.transform(data)
data.show()
lr = LinearRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
# Fit the model
model = lr.fit(data)
// Print the coefficients and intercept for linear regression
print("Coefficients: %s" % str(model.coefficients))
print("Intercept: %s" % str(model.intercept))
// Summarize the model over the training set and print out some metrics
trainingSummary = model.summary
print("numIterations: %d" % trainingSummary.totalIterations)
print("objectiveHistory: %s" % str(trainingSummary.objectiveHistory))
trainingSummary.residuals.show()
print("RMSE: %f" % trainingSummary.rootMeanSquaredError)
print("r2: %f" % trainingSummary.r2)
b. Logistic Regression
data = data.withColumn("label", F.when(F.col("num-of-doors") == "four", 1).otherwise(0)).select("label","length", "width","height")
data.show()
// Create vector assembler for feature columns
assembler = VectorAssembler(inputCols=data.columns[1:], outputCol="features")
data = assembler.transform(data)
lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
// Fit the model
model = lr.fit(data)
// Print the coefficients and intercept for logistic regression
print("Coefficients: " + str(model.coefficients))
print("Intercept: " + str(model.intercept))
// We can also use the multinomial family for binary classification
mlr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8, family="multinomial")
// Fit the model
mlr_model = mlr.fit(data)
// Print the coefficients and intercepts for logistic regression with multinomial family
print("Multinomial coefficients: " + str(mlr_model.coefficientMatrix))
print("Multinomial intercepts: " + str(mlr_model.interceptVector))
Bonus
The following steps are executed in each question in the previous part. Please feel free to check out each question to see the following tasks.
- Show confusion matrix for any machine learning algorithm
- Calculate Precision. Recall, and F1-score
- Inference on custom data for any algorithm of your own.
References: