Module 2: Lab #2 - SnehaMishra28/BigData_Programming_Summer2018 GitHub Wiki

Team: 12
Professor: Yugyung Lee

Name: Sneha Mishra
Class ID: 11
Email: [email protected]
MyGitHub

Technical Partner:
Name: Aditya Soman
Class ID: 19
Email: [email protected]
GitHub

YouTube Link explaining the Lab work can be found here
The report for the Lab work is here
The source code for this lab work can be found here
The available datasets formats can be found here

Objective

Understanding Spark Classification, Spark Streaming and Spark Graphx Task.

Features

Use of Classification Algorithms such as Naïve Bayes, Decision Tree, Random Forest for attribute classification.
Report the Confusion matrix, Accuracy based on FMeasure, Precision & Recall for all the algorithms.
Reason why one of algorithms out performs the rest.
Perform Word-Count on Twitter Streaming Data using Spark.
Perform Page Rank on given Dataset.
State importance of using graphx on the chosen dataset.

Steps:

Part 1: Spark Classification Task

This task contains working on 3 algorithms namely:

1. Naïve Bayes:

It is a classification technique based on Bayes’ theorem. Naïve Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature.

Code for the Algorithm: (Split is 60% - 40%)

Output after running the Algorithm:

2. Decision Tree:

Code for the Algorithm: (Split is 70% - 30%)

Output after running the Algorithm:

3. Random Tree:

Code for the Algorithm: (Split is 70% - 30%)

Output after running the Algorithm:

State the reasons on why one of algorithms out performs the rest:

The results indicate that the classification accuracy comparison between Naïve Bayes, Random Forest and Decision Trees that Decision Tree has got the highest average accuracy value than the Naïve Bayes and Random Forest but the difference is not statistically significant.