ICP 14 - a190884810/Big-Data-Programming GitHub Wiki

PROBLEM STATEMENT

1.Classification: a.Naïve Bayes b.Decision Tree c.Random Forest 2.Clustering: a.KMeans 3.Regression: a.Linear Regression b.Logistic Regression

FEATURES

For this in-class programming, PySpark, A collaboration of Apache Spark and Python is used. Apache Spark offers a Machine Learning API called MLlib. PySpark has this machine learning API in Python as well. In this in-class programming, MLlib is used to execute various machine learning algorithms.

CONFIGURATIONS

Python(Version 2.7) has been installed and variables are set.
Spark is installed and environment variables has been configured.
winutils is installed and is setup.
Finally, pip is installed and all the necessary packages are configured in Pycharm IDE.

APPROACH

1. NAIVE BAYES

This algorithm is based on Naive-Bayes theorem. A dataset is already given on which Naive Bayes was asked to perform. Here an adult.csv file is taken and an RDD is created on it. This RDD is of a labeled point is given as input and NaivBayesModel is generated as output. The below snippet shows how the code works followed by the output.

As it can be seen from the above code, age which is one field in csv file is renamed as label and particular columns such as hours-per-week etc are selected. In the next step, Data is split to perform both training and testing. Finally, The model is fit and the accuracy is generated.

As it can be seen, the accuracy is finally generated as an output.

2. DECISION TREES

Decision trees are greedy algorithms that performs recursive binary partitioning. Each partition is chosen greedily by selecting the best split from a set of possible splits, in order to maximize the information gain at a tree node. As mentioned above, the RDD is given as input, DecisionTreeClassifier is run and accuracy is generated as an output.

RANDOM FOREST

Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes or mean prediction of the individual trees.

CLUSTERING

K-MEANS

k-means is one of the most commonly used clustering algorithms that clusters the data points into a predefined number of clusters. As it can be seen from the code below, A CSV file on diabetics is loaded and an RDD is created. A vector is created for the sake of feature columns. Training is done and the model is fitted. Finally, Predictions are made and the cluster centers are displayed as shown.

REGRESSION

Linear Regression In this code, Linear Regression is performed on a given CSV file. The number of maximum iterations are given(In this case, It is 10) and transformation is performed on data. The data is fit and the LinearRegression is run The generated output files include, number of iterations that took place, Residuals, RMSE(Root Mean Squared Error).

LOGISTIC REGRESSION

Logistic regression is a statistical model that in its basic form uses a logistic function to model a binary dependent variable, although many more complex extensions exist. With respect to code, LogisticRegression is run on a fitted data. The data is taken from a CSV file that is already given. The output of this code delivers co-efficients, multi-nomial coefficients, and multi-nomial intercepts.