ICP_14 - PallaviArikatla/Big-Data-Programming GitHub Wiki

OBJECTIVE: Working on Machine Learning algorithms using spark.

IMPLEMENTATION:

  • Make initial setup.

  • Upload all the given datasets.

IMPLEMENTATION:

Question 1:

  • Read the dataset required to perform task and display the content.
  • Create dataframe.

  • Convert the data in the dataset to integer type.
  • Print the schema.

  • Consider a particular label and print the schema.

1) Naive Bayes.

  • Obtain accuracy using Naive Bayes algorithm.
  • Calculate accuracy with smoothing 1.0.

  • Calculate accuracy with smoothing value as 10.0.
  • Infer the difference.

  • There is no difference in the accuracy values obtained.

2) Decision Tree.

  • Split the data.
  • Apply Decision tree algorithm on the created train and test data from the input dataframe.
  • Calculate and obtain accuracy.

3) Random Forest.

  • Apply Random Forest algorithm.
  • Calculate accuracy with tree numbers as 10 initially.

  • Calculate accuracy with tree numbers as 100 initially.

Question 2: Clustering.

Kmeans algorithm.

  • Create spark dataframe using the dataset diabetes.
  • Apply Kmeans algorithm.
  • Divide clusters and calculate cluster centers.

Question 3: Regression.

Linear Regression.

  • Create dataframe using import-85 dataset.
  • Apply Linear Regression algorithm.
  • Calculate Root Means Square Error and R2square.

Logistic Regression.

  • Apply Logistic Regression algorithm.
  • Initiate multinominal function.