CASE3 - RoshiniVarada/BDP_Project2 GitHub Wiki

SPARK STREAMING

OBJECTIVE:

Perform Word-Count on Twitter Streaming Data.

INTRODUCTION:

Spark streaming helps to access and process Real-time data with the help of different algorithms like map, Reduce, join etc.
In this project we use mapReduce algorithm to perform word count on streamed data.

IDEA OF THE PROJECT:

Idea of the project is to apply different concepts learnt in Big-Data-Programming so far. Here we use MapReduce method to calculate word frequency.

USAGE OF PROJECT IN REAL-TIME:

Spark Streaming Context is used for processing the real-time data streams. In real-time, this idea of streaming helps in prediction, analyzing and data processing workloads etc.

IMPLEMENTATION:

Initiate a socket object with local machine's IP address and a service specific port number.
Bind the host and port.
Make client connection.

Create a twitter developer account to get access.
Get authorization and collect tweets under desired topic.
Here I have extracted tweets under topic football.

Create a class and a model where it extracts only text content from the entire data collected.
Collect the data and read it to self.
Run the program and the status gets displayed.

OUTPUT

Later initialize the streaming part.
Give the same port number as given earlier with which client receives the data from server.
Use flatMap, where it helps in splitting up a string or a sentence Separated and terminated by a delimiter space as shown in the code below.
Using Map and reduce methods, split up the words and count their repetition.
Store the resultant word frequency and print them. Thus output is obtained.

OUTPUT:

CHALLENGES FACED:

Had minor issues while writing tweets into socket as it consumes little more time comparatively. And also getting access from twitter also takes time. Irrespective of these two, entire execution procedure went very effectively without any obstacles.

MILESTONES AND INTEGRATION OF THE PROJECT:

Work went very effective and smooth without any obstacles. Team split-up made it easy to perform tasks. As there is no dependency for one task to other, individuals performed one task each and accomplished the tasks given.

TEAM MEMBERS AND CONTRIBUTION:

Roshini varada -- Hadoop MapReduce Algorithm -- https://github.com/RoshiniVarada/BDP_Project2/wiki/CASE1
Sarika Reddy Kota -- Spark Data Frames -- https://github.com/RoshiniVarada/BDP_Project2/wiki/CASE2
Pallavi Arikatla -- Spark streaming -- https://github.com/RoshiniVarada/BDP_Project2/wiki/CASE3
Zakari, Abdulmuhaymin -- Spark Graphx -- https://github.com/RoshiniVarada/BDP_Project2/wiki/CASE4

Video Link:

https://youtu.be/UtsiVZaijyg