Project Increment 1 - acikgozmehmet/BigDataProgramming GitHub Wiki

Twitter Analysis on Trending Tweeters

Dataset

Twitter makes public Tweets and replies available to developers, and allow developers to post Tweets via API. Developers can access Tweets by searching for specific keywords, or requesting a sample of Tweets from specific accounts. These endpoints can easily be used by people to identify, understand and counter misinformation around public health initiatives.

We collected tweets on the topic “coronavirus” and saved them to files to perform further analysis with Hive and Map-Reduce Framework.

The data we collected for the "coronovirus" pandemice is huge, so processing it in traditional manner does not seem to be possible. Thus we decided to use perform our analysis Hadoop Distributed File System to help us better understand the problem.

Detail Design of Features

MapReduce Framework:

One of the questions we had in our mind to find out the people tweeting more than others. In order to figure out the people who are using tweeter, we decided to build a framework which can enable us to work with huge data set. Thus, we employed a java-based Map Reduce framework to perform the analysis required for finding the trends in Tweeters.

As we all know, a tweet object has the user object which contains information about the owner of the tweet. We decided to create a design which will help us find out the number of tweets by user within a time interval. For doing it, we used java-json library in the framework in addition to default cloudera hadoop distributed filesystem.

Algorithm

So, in order to find the top trends in tweeters in a given snapshot, we would need to:

  1. Process all tweets and parse out tokens with "user.id_str"
  2. Count all the 'user.id_str's.
  3. Find out top n 'user.id_str's by sorting them.

Please click on the link to reach to the source code.

Implementation

In order to perform this analysis we created 2 separate Map-Reduce jobs; first one covering from step-1 and step-2, and the second one covering step-3 and step-4.

Step 1: Mapper

After distributing the data to the clusters, each tweet is tokenized and then de-serialized into tweet objects. Finally, the "user.id_str" which is the owner of the tweet is mapped into key-value pairs..

Step2: Reducer

  1. At this step, reducer reduces same user.id_str and sum-up the total to get the aggregate results.

  2. Reducer determines the total number of tweets by each user.id. It creates an output which is sorted by key values (user.id_str) as in the mapping phase the shuffle and sort step sorts them alphabetically on the basis of keys.

To get the desired output of sorting the result on the basis of number of occurrences of each user.id, each key-value pair has to be to be sorted on the basis of values. So we decided to pass this output to second Map-Reduce job which swaps the key and the value and then performs sorting.

Step 3: Mapper 2

In the second mapper phase, we tokenize the input and then jsut swapped them [put 2nd token (the number) as key and 1st token (user.id) as value.] While mapping it shuffles and sorts on the basis of keys.

Step 4: Reducer 2

Since the sorting of the keys by default is in ascending order, and we used a Comparator in mapper to get the desired listings. Reducer-2 swaps back the desired result again, namely user.id and occurrences.

Preliminary Results and Test

Test

In order to check the result, we created another java application. We realized that our Map-reduce job gives the correct results.

Please click on the link to reach to the code

Rerefences

https://help.twitter.com/en/rules-and-policies/twitter-api

https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/user-object

https://github.com/stleary/JSON-java

https://stackoverflow.com/questions/26659753/processing-json-using-java-mapreduce

https://anirudhbhatnagar.com/2013/05/08/using-map-reduce-to-find-the-twitter-trends/amp/