Module 2: Lab #2 - SnehaMishra28/BigData_Programming_Summer2018 GitHub Wiki
Team: 12
Professor: Yugyung Lee
Name: Sneha Mishra
Class ID: 11
Email: [email protected]
MyGitHub
Technical Partner:
Name: Aditya Soman
Class ID: 19
Email: [email protected]
GitHub
YouTube Link explaining the Lab work can be found here
The report for the Lab work is here
The source code for this lab work can be found here
The available datasets formats can be found here
Objective
Understanding Spark Classification, Spark Streaming and Spark Graphx Task.
Features
- Use of Classification Algorithms such as Naïve Bayes, Decision Tree, Random Forest for attribute classification.
- Report the Confusion matrix, Accuracy based on FMeasure, Precision & Recall for all the algorithms.
- Reason why one of algorithms out performs the rest.
- Perform Word-Count on Twitter Streaming Data using Spark.
- Perform Page Rank on given Dataset.
- State importance of using graphx on the chosen dataset.
Steps:
Part 1: Spark Classification Task
This task contains working on 3 algorithms namely:
1. Naïve Bayes:
It is a classification technique based on Bayes’ theorem. Naïve Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature.
Code for the Algorithm: (Split is 60% - 40%)
Output after running the Algorithm:
2. Decision Tree:
Code for the Algorithm: (Split is 70% - 30%)
Output after running the Algorithm:
3. Random Tree:
Code for the Algorithm: (Split is 70% - 30%)
Output after running the Algorithm:
State the reasons on why one of algorithms out performs the rest:
The results indicate that the classification accuracy comparison between Naïve Bayes, Random Forest and Decision Trees that Decision Tree has got the highest average accuracy value than the Naïve Bayes and Random Forest but the difference is not statistically significant.
Part 2: Spark Streaming Task
In this task we perform Word-Count on Twitter Streaming Data using Spark. First we get the Twitter data and then we stream it.
Collecting Tweets code:
Output of Collecting Tweets code:
Stream Twitter data code:
Here we used a 5 seconds window to scan the tweets.
Output of the Twitter Streaming data:
Part 3: Spark Graphx Task
In this task we perform Page Rank.
References:
- https://www.linkedin.com/pulse/apache-spark-streaming-twitter-python-laurent-weichberger/
- https://github.com/stefanobaghino/spark-twitter-stream-example
- https://www.researchgate.net/publication/318056374_Comparison_of_Naive_Bayes_Random_Forest_Decision_Tree_Support_Vector_Machines_and_Logistic_Regression_Classifiers_for_Text_Reviews_Classification
Data-sets provided:
-
Absenteeism at work:
https://archive.ics.uci.edu/ml/datasets/Absenteeism+at+work -
Immunotherapy Dataset:
https://archive.ics.uci.edu/ml/datasets/Immunotherapy+Dataset -
Nashville-meetup Dataset:
https://www.kaggle.com/stkbailey/nashville-meetup -
Word Game Dataset:
https://www.kaggle.com/anneloes/wordgame -
Cyber Crime Motive:
https://www.kaggle.com/sunilkumarsv/indiacybercrimestats2013