Lab Assignment 4 Spark MLlib classification algorithms, word count on twitter streaming - Ruthvicp/CS5590_BigDataProgramming GitHub Wiki

Team Id : 14

Member 1 : Ruthvic Punyamurtula

Class Id : 16

Member 2 : Shankar Pentyala

Class Id : 15

Source Code : https://github.com/Ruthvicp/CS5590_BigDataProgramming/tree/master/Lab/Lab4/Source

Video/Demo : https://youtu.be/42IJBhnslpk

Introduction

This lab assignment consists of using Spark MLlib classification alogirthms on the given data set and also run the word count on the twitter streaming data.

Objective

1. Classification Algorithms used are - Decision Tree, Naive Bayes, Random Forest

Approach

Read the data set and convert the column data from string to float/double type. We perform the classification based on the columns "Month of Absence", "Day of the week", "Height", "Travel expenses", "Distance" ,"Body Mass Index".

Workflow

We split the data into train-70%, Test-30%. Finally we evaluate the model and predict the results of the test data set. Then calculate the confusion matrix using the above columns and find the precision and recall values.

1. Decision Tree

We create a vector assembler on input data columns "label (Height)" and "Distance" and use the DecisionTreeClassifier on indexedlabel and indexed features

2. Naive Bayes

For the same data set and columns we perform naive bayes to find the prediction for absenteeism at work

3. Random Forest

We use Random forest Classifier for the input columns height and distance and create a vector assembler on this indexed data.

Data set and Parameter

The input file for this can be found at https://github.com/Ruthvicp/CS5590_BigDataProgramming/raw/master/Lab/Lab4/Source/Absenteeism_at_work.csv and also at https://archive.ics.uci.edu/ml/datasets/Absenteeism+at+work

Evaluation

we compare the results - Accuracy and Error calculated using the above classifier and pick the best possible based on highest accuracy or lowest error

Output

The snapshot for the output for Decision Tree is given below

The snapshot for the output for Naive Bayes is given below

The snapshot for the output for Random Forest is given below

conclusion

Based on the accuracy, for the columns chosen, Decision Tree has the highest accuracy of 95 % and the lowest error of 5%. Also the confusion matrix is plotted for all the 3 classifications and is shown in the above output images

2. Word count on Twitter streaming data

Introduction

We create a streaming context on a host and bind it to an available port and send the streaming context on it. Once the receiver starts listening to the same port, then the data is sent across. On this data we perform the word count to get the results.

Objectives

a) Create a twitter streaming class to connect to twitter using the auth credentials

b) Create a listening class to bind onto same host and port

c) Perform the word count on it

Approach

Binding a stream using " s.bind((host, port))". Establish the connection "c, addr = s.accept() ". Now send the data as in "sendData(c)".

Inside on_Data() : separate the twitter text and encode them before sending - self.client_socket.send(msg['text'].encode('utf-8'))

Workflow

Twitter Streaming class

Listening class and then the word count

Datasets & parameters

Set the authentication credentials given below

consumer_key
consumer_secret
access_token
access_secret

auth = OAuthHandler(consumer_key, consumer_secret) auth.set_access_token(access_token, access_secret)

in send_data() : type the below code to get twitter streams. I have filtered the tweets based on 'fifa' auth = OAuthHandler(consumer_key, consumer_secret) auth.set_access_token(access_token, access_secret)

twitter_stream = Stream(auth, TweetsListener(c_socket))
twitter_stream.filter(track=['fifa'])

Evaluation

To get proper word count results, enocode on sending side and decode the data on the receiving end. Set the window size and reduce the streaming context duration to have the word count done as fast as possible

Output

screen shot on running the twitter streaming class :

screen shot of the word count output on the tweets is :

Conclusion

The word count for the fifa tweets is done using the twitter streams by creating a streaming context in spark. We have done word count on 7237 tweets in 3 minutes of duration.

Lab Assignment 4 Spark MLlib classification algorithms, word count on twitter streaming - Ruthvicp/CS5590_BigDataProgramming GitHub Wiki

Team Id : 14

Member 1 : Ruthvic Punyamurtula

Class Id : 16

Member 2 : Shankar Pentyala

Class Id : 15

Introduction

Objective

Approach

Workflow

1. Decision Tree

2. Naive Bayes

3. Random Forest

Data set and Parameter

Evaluation

Output

conclusion

2. Word count on Twitter streaming data

Introduction

Objectives

Approach

Workflow

Twitter Streaming class

Listening class and then the word count

Datasets & parameters

Evaluation

Output

Conclusion

References