Project Increment 2 - bhargavi1411/BigDataProgramming GitHub Wiki

Twitter Sentiment analysis

Team 7 Madhuri Sarode : 24 Bhargavi Saipoojitha Chennupati : 4 Bhavana Deepthi Kota : 16

Increment - 2

Introduction:

This phase is an extension and improvement over previous increment 1. In it, movie’s tweet extraction and hashtag count were implemented using map reduce algorithm. In this increment 2, we perform sentimental analysis on the movies released in 2019 and find the popular and most successful genre of 2019.

Use case : Research about success of movies in different categories. This gives the film makers an idea about what genre movies are popular and liked by audience. This also gives commercial advantage, for investment.

Model Architecture Diagram with explanation

Dataset

Detailed description of Dataset : Our input data is runtime tweets data from twitter API. We have collected movie tweets belonging to the following genres

  1. Comedy
  2. Action
  3. Science fiction
  4. Kids

The sample dataset is shown below. The comedy movie noelle’s tweets are extracted and this becomes our primary dataset.

Detail design of Features with diagram : The dataset is obtained using python program which streams twitter tweets data for the given input key word.

We input the search query q= “#moviename” and filepath where the tweets has to be streamed. In the output file location specified, we can find the files which holds the tweets for the input keywords.

We extract 4 movie tweets of each genre. As seen below we have 4 comedy movies tweets files, 4 action movies tweets files and so on.

Analysis of data

Data Pre-processing: Each tweets data file has hundreds of tweets and each is an integration of username, retweet tag, hashtag, similes, punctuation marks, hyperlink etc. The tweets are in both lower and upper cases as well. All these unwanted contents are filtered and only the tweet content/ the tweet message is extracted along with the hashtags if there are any. There is a lengthy regex pattern used which filters and replaces unwanted strings.

Sample tweet file before preprocessing

Sample tweet file after pre-processing.

Implementation

Algorithms

Step 1: Extract the tweets data for each movie of different genre Step 2: Pre-process the data and store Step 3: Count all the hashtags from all the files and find the hashtag having highest count, which will be the most discussed hashtag Step 4: Iterate each file and each tweet in that file and compare it with words from positive, negative and neutral data dictionaries. Segregate and write them into separate files Step 5: Find the number of positive, negative or neutral tweets count to determine it’s effect on the audience. Step 6: Collectively sum up the positive tweet count of all the movies belonging to a specific genre. Whichever genre has the highest positive response is concluded as the successful genre of 2019.

Explanation of implementation

Clean input tweets and hashtag analysis

  1. The twitter tweets are extracted for 4 movies of each genre in the text format from twitter using python program. The count of the tweets are limited to 600~700. The number of input files that can be processed is not rigid/hardcoded, how many ever input files are input, all will be processed.(FilePath : Input/MovieName.txt)
  2. Tweets are cleaned to remove tags, usernames, hyperlinks etc. and written onto a file which just contains the tweet’s content (FilePath: Output/CleanedTweets).
  3. Tweets are read to extract hashtags, each tweet is searched to find the symbol ”#” and the following word is taken as a hashtag for that tweet and is written onto a text file. (FilePath : HashTagListFromTweets/MovieName.txt)
  4. The hashtags from all the movie tweets are counted using a map reduce job. Each movie’s hashtag list is processed and the written onto a file which contains each hashtag along with it’s count. (FilePath : Result/HashTagCompleteList.txt)
  5. The above mentioned hashtag list with its count is taken as input and sorted in descending order of their count to find the highest discussed movie hashtag. We also can see which all hashtags are having how many tweet counts. (FilePath : Result/HashTagCount)

Tweets analysis and classification

  1. A list of positive, negative and neutral responses are recorded observing the tweets in the form of list.
  2. The input tweets for each movie are read from the cleaned tweets file and each one is analyzed against 3 data lists of positive responses , negative responses and neutral responses.
  3. When the words in the tweet matches that of the positive response list, it is classified as positive response tweet and written onto file MovieName/PositiveTweets.txt (FilePath : Result/MovieName.txt/PositiveTweets.txt)
  4. When the words in the tweet matches that of the negative response list, it is classified as positive response tweet and written onto file MovieName/NegetiveTweets.txt (FilePath : Result/MovieName.txt/NegetiveTweets.txt)
  5. When the words in the tweet matches that of the neutral response list, it is classified as positive response tweet and written onto file MovieName/NeutralTweets.txt (FilePath : Result/MovieName.txt/NeutralTweets.txt)

Result extraction

  1. The positive, negative and neutral tweets count extracted for each movie and written onto a file. This result tells us what the movie’s impact was on audience. If the positive tweets are higher for a movie, then the movie is well received by the audience. If the negative tweet count is higher, the movie was not well received and majority of the audience did not like it. If the neutral tweets are higher compared to both positive and negative, then the movie failed to make an impression on the audience, majority of the reviewers could not form an opinion about it. (FilePath : Conclusion/Conclusion.txt)
  2. The results are also recorded in csv format as ConclusionData.csv so that the data can be viewed in database and can be queried for required results. (FilePath : Conclusion/ConclusionData.csv)
  3. For all the movies belonging to a genre, we extract the positive tweets count, add it and record and list it. We can see which genre had the greatest number of positive tweets count indicating it’s popularity. (FilePath : Conclusion/GenrePopularity.txt). So We conclude that a specific genre did well in 2019.

Results Diagrams for results with detailed explanation

Picture 1 : Hashtag vs count in descending order This result shows the movie hashtag trend and how actively movies were discussed. Each of these hashtags were taken from all the input movie files and counted collectively and sorted in descending order.

Picture 2: Positive Tweets categorized from movie tweets . Each movie file is checked against a list of positive, negative and neutral words data list. If the tweet has positive words, it is categorized as a positive tweet and written onto the file.

Picture 3 : Negetive Tweets categorized from movie tweets . Each movie file is checked against a list of positive, negative and neutral words data list. If the tweet has negetive words, it is categorized as a negetive tweet and written onto the file.

Picture 4: Neutral Tweets categorized from movie tweets . Each movie file is checked against a list of positive, negative and neutral words data list. If the tweet has neutral words, it is categorized as a neutral tweet and written onto the file.

Picture 5: Movie review based on positive tweet count Each movie’s positive, negative and neutral tweet counts are compared. If a movie has a higher positive tweet count, it is categorized as a successful and well received movie by the audience. Hence popular too. If a movie has a higher negative tweet count, it is categorized as a movie which was not well received by the audience and has certain elements which are not liked by them. If a movie has a higher neutral reviews, it is categorized as a movie which failed to make an impression positive/negative on the audience. This may be categorized as a movie which does not have any great elements, but also lacks the elements which audience dislike.

Picture 6 : The grand result of genre popularity

The picture shows 2 results

  1. Highest discussed movie hashtag along with the top 5 movie hashtags: Previously calculated hashtag list along with its tweet count is considered and the 1st record in it is the Highest discussed movie hashtag. And also the top 5 is extracted and listed in result file.
  2. The most popular genre among the group of genres considered : There are 4 movies released in 2019 for each genre considered, all the positive reviews/tweets of them are collectively summed up and the total reviews/tweets are calculated and listed. The genre that has the highest positive reviews is concluded as the most popular and successful genre of 2019.

The output data is written in csv file too, so that data can be imported into hive for querying.

Project Management

Work Completed : • Description: We implemented the functionality of highest discussed movie hashtag and genre popularity. • Responsibility: Python twitter tweet extraction, tableau graph plotting : Bhargavi Tweets cleaning, hashtag analysis, tableau graph plotting : Bhavana Tweets count and genre popularity, hive integration : Madhuri • Contribution Python twitter tweet extraction, tableau graph plotting : Bhargavi – 33% Tweets cleaning, hashtag analysis , tableau graph plotting : Bhavana – 33% Tweets count and genre popularity, hive integration : Madhuri – 34% • Work to be completed

  1. Tableau – graph plotting
  2. Hive Integration – The output file written in the form of csv and input to hive for quickly querying the results. • Responsibility
  3. Madhuri : Hive Integration
  4. Bhavana + Bhargavi : Tableau – graph plotting

• Issues/Concerns-None