Lab 02 : Spark Streaming Task - acikgozmehmet/BigDataProgramming GitHub Wiki

Lab 2: Spark Streaming Task

Overview

Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Kafka, Flume, Kinesis, or TCP sockets, and can be processed using complex algorithms expressed with high-level functions like map, reduce, join and window. Finally, processed data can be pushed out to filesystems, databases, and live dashboards.

Internally, it works as follows. Spark Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches.

Objective:

Perform Word-Count on Twitter Streaming Data using Spark

Count Hashtags in a Twitter Stream Using Windows.

We can implement word count algorithm on Twitter messages with static and dynamic approaches. The static approach is to pass all the tweets and perform the word count from the beginning of the time we started the Spark application. However, this would approach is not most accurate one since Twitter messages are time-bounded and they have a trend. The other approach is to use window operations in order to find the trends in Twitter messages. Here we will implement the second approach.

First of all, we connect to a socket where Twitter messages are streamed in. For doing that we need to create a Twitter Developer account and have the credentials to use the streaming. Then we used the following code which makes use of tweepy library to stream the Twitter messages to the console. We also filtered data by only English language.

Please feel free to click on the link to reach the code

We connect to a socket where the Twitter messages are streamed in. I would like to emphasis that we are also interested in the timestamp of the Twitter messages, since we will implement window operations on the messages.

In order to count hashtags, we'll tokenize every Twitter message in order to get the hashtags. All the words that do not start with a hashtag are marked as nonTag. Then we add these tags to the dataframe. Finally we perform the following transformations to get the hashtags using window

We filter out all rows that are nonTag words.
We group by window and count the hashtags within the window interval. We use a window size of 50 secs and sliding interval of 30 seconds so that we will see the trends at 50 second interval.