Lab Assignment 4 Spark MLlib classification algorithms, word count on twitter streaming - Ruthvicp/CS5590_BigDataProgramming GitHub Wiki
Team Id : 14
Member 1 : Ruthvic Punyamurtula
Class Id : 16
Member 2 : Shankar Pentyala
Class Id : 15
Source Code : https://github.com/Ruthvicp/CS5590_BigDataProgramming/tree/master/Lab/Lab4/Source
Video/Demo : https://youtu.be/42IJBhnslpk
Introduction
This lab assignment consists of using Spark MLlib classification alogirthms on the given data set and also run the word count on the twitter streaming data.
Objective
1. Classification Algorithms used are - Decision Tree, Naive Bayes, Random Forest
Approach
Read the data set and convert the column data from string to float/double type. We perform the classification based on the columns "Month of Absence", "Day of the week", "Height", "Travel expenses", "Distance" ,"Body Mass Index".
Workflow
We split the data into train-70%, Test-30%. Finally we evaluate the model and predict the results of the test data set. Then calculate the confusion matrix using the above columns and find the precision and recall values.
1. Decision Tree
We create a vector assembler on input data columns "label (Height)" and "Distance" and use the DecisionTreeClassifier on indexedlabel and indexed features
2. Naive Bayes
For the same data set and columns we perform naive bayes to find the prediction for absenteeism at work
3. Random Forest
We use Random forest Classifier for the input columns height and distance and create a vector assembler on this indexed data.
Data set and Parameter
The input file for this can be found at https://github.com/Ruthvicp/CS5590_BigDataProgramming/raw/master/Lab/Lab4/Source/Absenteeism_at_work.csv and also at https://archive.ics.uci.edu/ml/datasets/Absenteeism+at+work
Evaluation
we compare the results - Accuracy and Error calculated using the above classifier and pick the best possible based on highest accuracy or lowest error
Output
The snapshot for the output for Decision Tree is given below
The snapshot for the output for Naive Bayes is given below
The snapshot for the output for Random Forest is given below
conclusion
Based on the accuracy, for the columns chosen, Decision Tree has the highest accuracy of 95 % and the lowest error of 5%. Also the confusion matrix is plotted for all the 3 classifications and is shown in the above output images
2. Word count on Twitter streaming data
Introduction
We create a streaming context on a host and bind it to an available port and send the streaming context on it. Once the receiver starts listening to the same port, then the data is sent across. On this data we perform the word count to get the results.
Objectives
a) Create a twitter streaming class to connect to twitter using the auth credentials
b) Create a listening class to bind onto same host and port
c) Perform the word count on it
Approach
Binding a stream using " s.bind((host, port))". Establish the connection "c, addr = s.accept() ". Now send the data as in "sendData(c)".
Inside on_Data() : separate the twitter text and encode them before sending - self.client_socket.send(msg['text'].encode('utf-8'))
Workflow
Twitter Streaming class
Listening class and then the word count
Datasets & parameters
Set the authentication credentials given below
- consumer_key
- consumer_secret
- access_token
- access_secret
auth = OAuthHandler(consumer_key, consumer_secret) auth.set_access_token(access_token, access_secret)
in send_data() : type the below code to get twitter streams. I have filtered the tweets based on 'fifa' auth = OAuthHandler(consumer_key, consumer_secret) auth.set_access_token(access_token, access_secret)
twitter_stream = Stream(auth, TweetsListener(c_socket))
twitter_stream.filter(track=['fifa'])
Evaluation
To get proper word count results, enocode on sending side and decode the data on the receiving end. Set the window size and reduce the streaming context duration to have the word count done as fast as possible
Output
screen shot on running the twitter streaming class :
screen shot of the word count output on the tweets is :
Conclusion
The word count for the fifa tweets is done using the twitter streams by creating a streaming context in spark. We have done word count on 7237 tweets in 3 minutes of duration.
References
- https://spark.apache.org/docs/latest/ml-decision-tree.html
- https://spark.apache.org/docs/2.2.0/mllib-naive-bayes.html
- https://weiminwang.blog/2016/06/09/pyspark-tutorial-building-a-random-forest-binary-classifier-on-unbalanced-dataset/
- https://stackoverflow.com/questions/43872281/pyspark-find-number-of-tweets-that-contain-a-word-hashtag