Lab Assignment 4 - sirisha1206/Spark GitHub Wiki
Name:Naga Sirisha Sunkara
Class ID:21
Team ID:5
Technical partners details:
Name:Vinay Santhosham
Class ID:17
Objective:
The objective of this assignment is to compare the different machine algorithms like naive bayes,decision tree ,random forest algorithms and perform the word count on twitter data analysis in apache spark.
Task 1: Comparison of different machine learning algorithms
For implementing we used immunotherapy dataset.
We have trained the models on columns like age,area and induration diameter and predicted the output.
Dataset:
Naive Bayes Algorithm:
Output for 80% Training and 20% testing data:
Output for 70% training data and 30% testing data:
Decision Tree Algorithm:
Output for 80% Training and 20% testing data:
Output for 70% training data and 30% testing data:
Random Forest Algorithm:
Output for 80% Training and 20% testing data:
Output for 70% training data and 30% testing data:
Observations:
Among the different machine learning algorithms for immunotherapy dataset,we have got the best accuracy for decision tree algorithm of 0.69. Decision tree algorithm handle missing values as easily as any normal value of the variable.Decision tree algorithm run fast even with lots of observations and variables and trees can be used for supervised and unsupervised learning.
Task 2: Word count of twitter data in Apache Spark
1.For getting the consumer key , consumer secret ,access token and access token secret ,log in to the apps.twitter.com and create an application and there we can find our access keys and tokens.
2.Using the python tweepy module we have collected the twitter data on different emotions like sad,happy,joyful,etc..,
3.save the tweets into a text file and load it into the hdfs file system.
4.Perform the map reduce steps for word count and save them into a data file.