ICP_14 - PallaviArikatla/Big-Data-Programming GitHub Wiki
OBJECTIVE: Working on Machine Learning algorithms using spark.
IMPLEMENTATION:
- Make initial setup.
- Upload all the given datasets.
IMPLEMENTATION:
Question 1:
- Read the dataset required to perform task and display the content.
- Create dataframe.
- Convert the data in the dataset to integer type.
- Print the schema.
- Consider a particular label and print the schema.
1) Naive Bayes.
- Obtain accuracy using Naive Bayes algorithm.
- Calculate accuracy with smoothing 1.0.
- Calculate accuracy with smoothing value as 10.0.
- Infer the difference.
- There is no difference in the accuracy values obtained.
2) Decision Tree.
- Split the data.
- Apply Decision tree algorithm on the created train and test data from the input dataframe.
- Calculate and obtain accuracy.
3) Random Forest.
- Apply Random Forest algorithm.
- Calculate accuracy with tree numbers as 10 initially.
- Calculate accuracy with tree numbers as 100 initially.
Question 2: Clustering.
Kmeans algorithm.
- Create spark dataframe using the dataset diabetes.
- Apply Kmeans algorithm.
- Divide clusters and calculate cluster centers.
Question 3: Regression.
Linear Regression.
- Create dataframe using import-85 dataset.
- Apply Linear Regression algorithm.
- Calculate Root Means Square Error and R2square.
Logistic Regression.
- Apply Logistic Regression algorithm.
- Initiate multinominal function.