Lab Assignment 4 - sirisha1206/Spark GitHub Wiki

Name:Naga Sirisha Sunkara

Class ID:21

Team ID:5

Technical partners details:

Name:Vinay Santhosham

Class ID:17

Source Code Link: Video Link

Objective:

The objective of this assignment is to compare the different machine algorithms like naive bayes,decision tree ,random forest algorithms and perform the word count on twitter data analysis in apache spark.

Task 1: Comparison of different machine learning algorithms

For implementing we used immunotherapy dataset.

We have trained the models on columns like age,area and induration diameter and predicted the output.

Dataset:

Naive Bayes Algorithm:

Output for 80% Training and 20% testing data:

Output for 70% training data and 30% testing data:

Decision Tree Algorithm:

Output for 80% Training and 20% testing data:

Output for 70% training data and 30% testing data:

Random Forest Algorithm:

Output for 80% Training and 20% testing data:

Output for 70% training data and 30% testing data:

Observations:

Among the different machine learning algorithms for immunotherapy dataset,we have got the best accuracy for decision tree algorithm of 0.69. Decision tree algorithm handle missing values as easily as any normal value of the variable.Decision tree algorithm run fast even with lots of observations and variables and trees can be used for supervised and unsupervised learning.

Task 2: Word count of twitter data in Apache Spark

1.For getting the consumer key , consumer secret ,access token and access token secret ,log in to the apps.twitter.com and create an application and there we can find our access keys and tokens.

2.Using the python tweepy module we have collected the twitter data on different emotions like sad,happy,joyful,etc..,

3.save the tweets into a text file and load it into the hdfs file system.

4.Perform the map reduce steps for word count and save them into a data file.

Generating keys and tokens in twitter:

Python code for collecting twitter data:

Tweets collected:

Steps to be performed in spark shell:

Commands to be performed for loading the dataset into the hdfs file system and getting the wordcount output directory:

Word count output: