Sentiment Extraction - SubhasisDutta/Text-Analysis GitHub Wiki

Sentiment extraction is a complex task where a variety of factors needs to be taken into consideration for producing accurate results. After considerable research efforts, we've found that the following criteria determine the sentiment of a given text:

The domain in which the text is present such as Electronics, Movies, Books, Food etc.
The source of text which is to be analyzed such as Blogs, Reviews, Tweets etc.
Presence of Acronyms (LOL, OMG etc.), Emoticons ( :), (: etc. ) and Hashtags in social media text.

We have utilized the following components to make sure the right sentiment of a given text is extracted regardless of the Domain, Source or the presence of non standard text such as Emoticons.

A TextFilter to handle non standard text such as Internet Abbreviations ( LOL, ASAP ), Emoticons ( :) , (: ), Spelling variations and Hashtags.
Machine learning based models trained for these top domains with relevant training data : Products, Electronics and Technology, Movies, Services, Books, Food, Hotels and Bars, Music, Places, Restaurants, Travel and Tour.
Machine learning based models trained for 25 sub domains with relevant training data : Camera & Photo, Mobile Phones, Computer, Video games, Software, Video, Apparel, Automotive, Baby, Beauty, Food, Grocery, Personal, Jewelry, Watches, Kitchen & housewares, Magazines, Musical Instruments, Office products, Outdoor living, Sports, Tools, Hardware, Toys & Games, Health-care.
Module for unified handling of different sources of text such as Blogs, Reviews and Tweets along with assigning variable scores for different parts of a given source such as Title, First Paragraph and Last Paragraph.

Our sentiment extraction algorithm utilizes the above modules and assigns a score between -5 to 5 for a given input text, where -5 represents very negative and +5 represents very positive sentiment. Some examples of the sentiment extraction algorithm is provided below:

Sentiment Examples

This module utilizes algorithms and tools show below:

Twokenize - Parse relevant part of Text from Tweets
RAKE - Keyword extraction from given Text
Scikit-Learn - Compute TF-IDF vectors of Text, Train Support vector regression models and to prune features using Chi-square transformation
Labeled Training data characterizing the different domains.