ICP 14 - manaswinivedula/Big-Data-Programming GitHub Wiki

Task 1- Classification

Naive Bayes

It is a classification technique based on Bayes’ Theorem with an assumption of independence among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature.
Initially the dataset has been read and the few columns are label encoded and their name is being changed
Then input columns are selected and the data is splitter into training and testing data into 70:30 ratio
The training data is trained on the Naive Bayes algorithm and the test data will be tested on the trained model.
finally the essential metrics are calculated.

The following is the source code for Naive Bayes

The following are the schemas and the features that are considered for all the classification models.

The following are the calculated metrics for the Naive Bayes classification.

Decision Tree

Decision Tree algorithm belongs to the family of supervised learning algorithms. Unlike other supervised learning algorithms, the decision tree algorithm can be used for solving regression and classification problems too.
Initially the dataset has been read and the few columns are label encoded and their name is being changed
Then input columns are selected and the data is splitter into training and testing data into 70:30 ratio
The training data is trained on the Decision tree algorithm and the test data will be tested on the trained model.
finally the essential metrics are calculated.

The following is the source code for the Decision tree.

The following are the calculated metrics for the Decision tree.

Random Forest

Random forest is a supervised learning algorithm. The "forest" it builds, is an ensemble of decision trees, usually trained with the “bagging” method. The general idea of the bagging method is that a combination of learning models increases the overall result.
Initially the dataset has been read and the few columns are label encoded and their name is being changed
Then input columns are selected and the data is splitter into training and testing data into 70:30 ratio
The training data is trained on the Decision tree algorithm and the test data will be tested on the trained model.
finally the essential metrics are calculated.

The following is the source code for the Random Forest.

The following are the calculated metrics for the Random Forest.

Clustering

K Means

Kmeans algorithm is an iterative algorithm that tries to partition the dataset into Kpre-defined distinct non-overlapping subgroups (clusters) where each data point belongs to only one group. It tries to make the intra-cluster data points as similar as possible while also keeping the clusters as different (far) as possible. It assigns data points to a cluster such that the sum of the squared distance between the data points and the cluster’s centroid (arithmetic mean of all the data points that belong to that cluster) is at the minimum. The less variation we have within clusters, the more homogeneous (similar) the data points are within the same cluster.
Initially the dataset has been read and the few columns are label encoded and their name is being changed
Then input columns are selected and the data is splitter into training and testing data into 70:30 ratio
The training data is trained on the K Means algorithm and the test data will be tested on the trained model.
finally the essential metrics are calculated.

The following is the source code for the K Means.

The following are the schemas and the features that are considered for all the clustering model.

The following are the calculated centers for the K Means.

Regression

Linear regression

linear Regression is a supervised machine learning algorithm where the predicted output is continuous and has a constant slope. It's used to predict values within a continuous range, (e.g. sales, price) rather than trying to classify them into categories.
Initially the dataset has been read and the few columns are label encoded and their name is being changed
Then input columns are selected and the data is splitter into training and testing data into 70:30 ratio
The training data is trained on the Linear Regression algorithm and the test data will be tested on the trained model.
finally the essential metrics are calculated.

The following is the source code for the Linear regression.

The following are the schemas and the features that are considered for all the Regression models.

The following is the output of linear Regression.

Logistic Regression

Logistic Regression is a Machine Learning algorithm which is used for the classification problems, it is a predictive analysis algorithm and based on the concept of probability.
Initially the dataset has been read and the few columns are label encoded and their name is being changed
Then input columns are selected and the data is splitter into training and testing data into 70:30 ratio
The training data is trained on the logistic regression algorithm and the test data will be tested on the trained model.
finally the essential metrics are calculated.

The following is the source code for the logistic regression.

The following is the output of logistic regression.

Referneces