Project Increment 2 - acikgozmehmet/BigDataProgramming GitHub Wiki

Twitter Data Analysis-Visualization & Classification (?)

Increment-2

Team 1a: (Mehmet, Jonathan)

Introduction

The advances in science and technology have made the modern lifestyle more interactive with people through social media such as twitter. With the advances in technology, people started to be more active in Twitter by sharing their ideas, criticism, support. It has become a new way for people to communicate among themselves in addition to actual use cases in business, marketing, and education etc.

Besides this personal use of Twitter, it also provides valuable information about the global threats such as COVID19 pandemic. We believe that Twitter messages provide instant pictures of threat that will shed light to unseen part of the pandemic.

Background

The work presented in this document is an extension of the previous work (Increment-1). The previous work included analysis of data set, implementing Map-Reduce to get some insight of the data and sentiment analysis in Hive.

With the work in Phase 2, some extra analysis are performed in SparkSQL and visualized by cutting edge technologies available. With the in-depth analysis of data set, a classification model is going to be built by finding the correlation between some fields in schema of data set.

Model

The flow of the model starts with creation of Twitter Developer account to stream the Twitter data. The Twitter data is collected with the Twitter4j library by filtering "English" language to have a unique language in the data. This approach saves valuable time in cleaning the data. Then the data is stored in HDFS file system for further analysis. A set of Map-Reduce algorithm is implemented in order to get the preliminary insight of the data which will be used in the modelling part of the flow.

Dataset

o Detailed description of Dataset

The data set used in this project is collected on between 2.30-6.51 UTC on March 13 and March 14. The data set contains almost 200.000 tweets.

o Detail design of Features with diagram

Analysis of data

As we all know, Hadoop and the MapReduce frameworks have been around for a long time now in big data analytics. But these frameworks require a lot of read-write operations on a hard disk which makes it very expensive in terms of time and speed.

Apache Spark is the most effective data processing framework in enterprises today. It’s true that the cost of Spark is high as it requires a lot of RAM for in-memory computation but it’s still a hot favorite among Data Scientists and Big Data Engineers.

Spark SQL is an amazing blend of relational processing and Spark’s functional programming. It provides support for various data sources and makes it possible to make SQL queries, resulting in a very powerful tool for analyzing structured data at scale.

Here are some of the Spark SQL features:

Query Structure Data within Spark Programs
Compatible with Hive
One Way to Access Data
Performance and Scalability
User-Defined Functions

We used the following queries in SparkSQL in order to better understand the data that will eventually help us in building the model. We used some other cutting edges tools such as pandas, matplotlib in addition to Apache Spark.

spark = SparkSession.builder.appName("Twitter PySpark Application").master("local[*]").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
tweetsDF = spark.read.json("all.txt", multiLine=False)
tweetsDF.createOrReplaceTempView("table")

1. Top 10 Countries where Twitter messages are tweeted.

The countries which Tweets are coming from is important because, it shows the countries which are struggling with the COVID19 pandemic. It also provides information people's feelings about the pandemic. When we analyze the data USA being the first country; India, Australia and Great Britain are the next countries.

number_of_tweets_from_null_country = sum(spark.sql("SELECT COUNT(*) as count FROM table WHERE place.country_code IS NULL").collect()[0])
tweets_from_country = spark.sql("SELECT place.country_code, COUNT(*) AS count FROM table WHERE place.country_code IS NOT NULL GROUP BY place.country_code ORDER BY count DESC")

x = tweets_from_country.toPandas()["country_code"].values.tolist()[:10]
number_of_tweets_from_country = tweets_from_country.toPandas()["count"].values.tolist()
y = number_of_tweets_from_country[:10]

print('number of all tweets', sum(number_of_tweets_from_country) + number_of_tweets_from_null_country)  # To test

plt.rcParams.update({'axes.titlesize': 'small'})
plt.barh(x, y, color='red')
plt.title("Top 10 Country Codes Available In Tweets")
plt.ylabel("Countries")
plt.xlabel("Number of Tweets")

2. Tweets Distribution in USA

tweets_from_USA = spark.sql("SELECT user.location, COUNT(*) AS count FROM table WHERE user.location LIKE '%USA%' GROUP BY user.location ORDER BY count DESC")
labels = tweets_from_USA.toPandas()["location"].values.tolist()[:10]
sizes = tweets_from_USA.toPandas()["count"].values.tolist()[:10]
explode = (0.1, 0, 0, 0, 0, 0, 0, 0, 0, 0)  # only "explode" the 1st slice
fig1, ax1 = plt.subplots()
ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%', shadow=False, startangle=90)
ax1.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.

plt.title("Tweets Distribution in USA")

3. Top 10 Tweeters

tweets_dist_person = spark.sql(Select  user.id_str, COUNT(user.id_str) AS count from table WHERE user.id_str is not null GROUP BY user.id_str ORDER BY count DESC")
x = tweets_dist_person.toPandas()["id_str"].values.tolist()[:10]
y = tweets_dist_person.toPandas()["count"].values.tolist()[:10]

figure = plt.figure()
axes = figure.add_axes([0.35, 0.1, 0.60, 0.85])
plt.barh(x, y, color='blue')
plt.title("Top 10 Tweeters")
plt.ylabel("User id")
plt.xlabel("Number of Tweets")

4. Top 10 People Who Have Most Friends

friendsCountDF = spark.sql("select user.screen_name, user.friends_count  AS friendsCount from table where (user.id_str, created_at) in (select user.id_str, max(created_at) as created_at from table group by user.id_str ) ORDER BY friendsCount DESC")
x = friendsCountDF.toPandas()["screen_name"].values.tolist()[:10]
y = friendsCountDF.toPandas()["friendsCount"].values.tolist()[:10]

figure = plt.figure()
axes = figure.add_axes([0.3, 0.1, 0.65, 0.85])
plt.rcParams.update({'axes.titlesize': 'small'})
plt.barh(x, y, color='green')
plt.title("Top 10 People Who Have Most Friends")
plt.ylabel("Screen Name")
plt.xlabel("Number of Friends")

5. Hashtags Distribution

hashtagsDF = spark.sql("SELECT hashtags, COUNT(*) AS count FROM (SELECT explode(entities.hashtags.text) AS hashtags FROM table) WHERE hashtags IS NOT NULL GROUP BY hashtags ORDER BY count DESC")

labels = hashtagsDF.toPandas()["hashtags"].values.tolist()[:10]
sizes = hashtagsDF.toPandas()["count"].values.tolist()[:10]
fig1, ax1 = plt.subplots()
ax1.pie(sizes, labels=labels, autopct='%1.1f%%', shadow=False, startangle=90)
ax1.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
plt.title("Hashtags Distribution")

6. Tweet Distribution according to time-Time series

tweet_distributionDF1 = spark.sql("SELECT SUBSTRING(created_at,12,5) as time_in_hour, COUNT(*) AS count FROM table GROUP BY time_in_hour ORDER BY time_in_hour ")
from pyspark.sql import functions as F
tweet_distributionDF = tweet_distributionDF1.filter(F.col("count") > 2)

x = pandas.to_numeric(tweet_distributionDF.toPandas()["time_in_hour"].str[:2].tolist()) + pandas.to_numeric(
    tweet_distributionDF.toPandas()["time_in_hour"].str[3:5].tolist()) / 60
y = tweet_distributionDF.toPandas()["count"].values.tolist()

tick_spacing = 1
fig, ax = plt.subplots(1, 1)
ax.plot(x, y)
ax.xaxis.set_major_locator(ticker.MultipleLocator(tick_spacing))

plt.title("Tweets Distribution By Minute")
plt.xlabel("Hours (UTC)")
plt.ylabel("Number of Tweets")

7. Top 10 Devices Used in the Tweets

df = spark.sql("SELECT source, COUNT(*) AS  total_count FROM table WHERE source IS NOT NULL GROUP BY source ORDER BY total_count DESC")
first = df.toPandas()["source"].str.index(">") + 1
last = df.toPandas()["source"].str.index("</a>")

text = df.toPandas()["source"].values.tolist()[:10]
x = []
for i in range(len(text)):
    x.append(text[i][first[i]:last[i]])

y = df.toPandas()["total_count"].values.tolist()[:10]

figure = plt.figure()
axes = figure.add_axes([0.3, 0.1, 0.65, 0.85])
plt.barh(x, y, color='blue')
plt.ylabel("Device name")
plt.xlabel("Number of Devices")
plt.title("Top Devices Used in the Tweets")

8. Tweets by Verified & Unverified Users

verified_usersDF = spark.sql("SELECT user.verified, COUNT(*) AS count FROM table  GROUP BY user.verified ORDER BY user.verified ASC")

labels = verified_usersDF.toPandas()["verified"].values.tolist()[:2]
sizes = verified_usersDF.toPandas()["count"].values.tolist()[:2]
explode = (0, 0.1)  # only "explode" the 2nd slice (i.e. 'Hogs')
fig1, ax1 = plt.subplots()
ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%', shadow=False, startangle=90)
ax1.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
plt.title("Tweets by Verified & Unverified Users")

Sentiment Analysis

The AFINN lexicon is perhaps one of the simplest and most popular lexicons that can be used extensively for sentiment analysis. AFINN is basically a list of words rated with an integer value between minus five (negative) and plus five (positive) and zero (neutral). Before we use the python Afinn library for determining sentiment in the tweets, we cleared the special characters, emojis, RT and web sites links and used the cleared text to get the score in sentiment analysis.

import re
from afinn import Afinn
df = tweetsDF.select("full_text").toPandas()
afinn = Afinn()
positive = 0;
neutral = 0
negative = 0;
for i in range(len(df)):
    txt = df.loc[i]["full_text"]
    txt = re.sub(r'@[A-Z0-9a-z_:]+', '', str(txt))  # replace username-tags
    txt = re.sub(r'^[RT]+', '', str(txt))  # replace RT-tags
    txt = re.sub('https?://[A-Za-z0-9./]+', '', str(txt))  # replace URLs
    txt = re.sub("[^a-zA-Z]", " ", str(txt))  # replace hashtags
    df.at[i, "full_text"] = txt
    sentiment_score = afinn.score(txt)
    if sentiment_score > 0:
        positive = positive + 1
    elif sentiment_score < 0 :
        negative = negative + 1
    else:
        neutral = negative + 1

labels = ["Positive" , "Negative", "Neutral"]
sizes = [positive, negative, neutral]
explode = (0, 0.1, 0)  # only "explode" the 2nd slice (i.e. 'Hogs')

fig1, ax1 = plt.subplots()
ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%', shadow=False, startangle=90)
ax1.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
plt.title("Sentiment Analysis")
plt.savefig(plots_folder + filename + ".png", dpi=1200)

o Data Pre-processing through integration of lectures (at least 4) from module 1 and 2

o Graph model with explanation

Implementation

o Algorithms / Pseudocode o Explanation of implementation

Results

o Diagrams for results with detailed explanation

Project Management

o Implementation status report Work completed (should be >= 85%):

Responsibility (Task, Person)

Data collection

Analysing Streaming API and Twitter Developer Access - Jonathan (100 %)
Data Streaming Code - Jonathan (100 %)
Documentation - Mehmet (50 %) Jonathan (50 %)

Data Cleaning

Data Cleaning code - Jonathan ( 70 %) Mehmet (30 %)
Data Merging - Mehmet (100%)
Documentation - Mehmet (50%) Jonathan (50%)

Sentiment Analysis

Algorithm Analysis & Design - Mehmet (100 %)
Coding - Mehmet (100 %)
Visualization - Mehmet (100 %)
Documentation - Mehmet (100 %)

Data Analysis and Visualization

Algorithm Analysis & Design - Mehmet (80 %), Jonathan (20 %)
Coding - Mehmet (100 %)
Visualization - Mehmet (100 %)

MapReduce Framework

Design for Mapper, Reducer, Main - Jonathan (80 %), Mehmet (20%)
Coding for Mapper - Jonathan (100 %)
Coding for Reducer - Jonathan (100 %)
Documentation - Jonathan (100 %)

Work to be completed

• Description

• Responsibility (Task, Person)

• Issues/Concerns

References/Bibliography

https://github.com/JAWolfe04/CS5590-Group-Project/wiki https://www.analyticsvidhya.com/blog/2020/02/hands-on-tutorial-spark-sql-analyze-data/