Introducing Sentimentron - Sentimentron/docs GitHub Wiki
Sentimentron is a tool which shows you website's sentiment through time. It's full and official title is An Automated Tool for Detecting Bias in Internet Based Hypermedia - which is pretty much what it allows you to do.
Core technologies
Sentimentron is mostly written in Python. Most of the analysis goes on in two components, pysen and pydate. The first analyses the sentiment of English text by dividing documents into unambiguous segments called phrases which consist of adjectives, adverbs, nouns and verbs, which it then classifies using SentiWordNet. pysen takes the values of these phrases as a vector, and correlates them against a library of pre-trained phrases mostly taken from Pang and Lee's sentence polarity dataset (version 2). It then uses a decision tree classifier to determine document-level sentiment, trained using Pang and Lee's polarity dataset. At each stage of its analysis, pysen uses a probability model to prevent phrases and sentences which are likely to have been incorrectly classified from being used in further classification decisions. pydate probabilistically extracts dates from HTML using dateutil and BeautifulSoup.
The rest of Sentimentron consists of a flask-based API for communicating with the website, and a boto-based backend. Sentimentron's website uses jQuery extensively, and flot for charts.
Sentimentron's data comes from the CommonCrawl using pages sourced from the top 4000 sites linked to from Wikipedia. The data comes mostly from 2008 - it's too costly to process anything more.
Reporting bugs and concerns
Sentimentron's currently being tested, it's got some rough edges, and some omissions. If you encounter any, email [email protected].