Project Proposal - VidyullathaKaza/BigData_Programming_Spring2020 GitHub Wiki

A system to store, analyze, and visualize Twitter’s tweets on COVID-19

TEAM-4

Vidyullatha Lakshmi Kaza - 8

Aparna Manda – 11

Lohitha Yenugu – 19

Overview:

The initial phase of the project is focused on the collection of “Twitter” data. More than 100,000 tweets have been collected. The hashtags used in these tweets have been filtered through extraction using appropriate code. Along with the extraction, word count is performed identifying the number of times each of the hashtags are used within the tweets. This creates a foundation for the data analysis to be done from the information collected and filtered.

Tools Used:

Team members had Windows based machines to work. Apache Hadoop was primarily used for data extraction and filtering of hashtags. Along with this, Map Reduce was also used when performing word count operation of the data gathered.

Language Used:

Python, Java

Key Components:

• Number of tweets used (extracted) - 100,000

• Keywords used - COVID-19, COVID19, covid-19, covid19, corona, CORONA

• Columns extracted - date, user, is_retweet, is_quoted, text, quoted_text