ICP_14 - PallaviArikatla/Big-Data-Programming GitHub Wiki

OBJECTIVE: Working on Machine Learning algorithms using spark.

IMPLEMENTATION:

Make initial setup.

Upload all the given datasets.

IMPLEMENTATION:

Question 1:

Read the dataset required to perform task and display the content.
Create dataframe.

Convert the data in the dataset to integer type.
Print the schema.

Consider a particular label and print the schema.

1) Naive Bayes.

Obtain accuracy using Naive Bayes algorithm.
Calculate accuracy with smoothing 1.0.

Calculate accuracy with smoothing value as 10.0.
Infer the difference.

There is no difference in the accuracy values obtained.

2) Decision Tree.

Split the data.
Apply Decision tree algorithm on the created train and test data from the input dataframe.
Calculate and obtain accuracy.

3) Random Forest.

Apply Random Forest algorithm.
Calculate accuracy with tree numbers as 10 initially.

Calculate accuracy with tree numbers as 100 initially.

Question 2: Clustering.

Kmeans algorithm.

Create spark dataframe using the dataset diabetes.
Apply Kmeans algorithm.
Divide clusters and calculate cluster centers.

Question 3: Regression.

Linear Regression.

Create dataframe using import-85 dataset.
Apply Linear Regression algorithm.
Calculate Root Means Square Error and R2square.

Logistic Regression.

Apply Logistic Regression algorithm.
Initiate multinominal function.