Sentiment Extraction - SubhasisDutta/Text-Analysis GitHub Wiki
Sentiment extraction is a complex task where a variety of factors needs to be taken into consideration for producing accurate results. After considerable research efforts, we've found that the following criteria determine the sentiment of a given text:
- The domain in which the text is present such as Electronics, Movies, Books, Food etc.
- The source of text which is to be analyzed such as Blogs, Reviews, Tweets etc.
- Presence of Acronyms (LOL, OMG etc.), Emoticons ( :), (: etc. ) and Hashtags in social media text.
We have utilized the following components to make sure the right sentiment of a given text is extracted regardless of the Domain, Source or the presence of non standard text such as Emoticons.
-
A TextFilter to handle non standard text such as Internet Abbreviations ( LOL, ASAP ), Emoticons ( :) , (: ), Spelling variations and Hashtags.
-
Machine learning based models trained for these top domains with relevant training data : Products, Electronics and Technology, Movies, Services, Books, Food, Hotels and Bars, Music, Places, Restaurants, Travel and Tour.
-
Machine learning based models trained for 25 sub domains with relevant training data : Camera & Photo, Mobile Phones, Computer, Video games, Software, Video, Apparel, Automotive, Baby, Beauty, Food, Grocery, Personal, Jewelry, Watches, Kitchen & housewares, Magazines, Musical Instruments, Office products, Outdoor living, Sports, Tools, Hardware, Toys & Games, Health-care.
-
Module for unified handling of different sources of text such as Blogs, Reviews and Tweets along with assigning variable scores for different parts of a given source such as Title, First Paragraph and Last Paragraph.
Our sentiment extraction algorithm utilizes the above modules and assigns a score between -5 to 5 for a given input text, where -5 represents very negative and +5 represents very positive sentiment. Some examples of the sentiment extraction algorithm is provided below:
This module utilizes algorithms and tools show below:
- Twokenize - Parse relevant part of Text from Tweets
- RAKE - Keyword extraction from given Text
- Scikit-Learn - Compute TF-IDF vectors of Text, Train Support vector regression models and to prune features using Chi-square transformation
- Labeled Training data characterizing the different domains.