Twitter NLP - lmucs/grapevine GitHub Wiki

Twitter-NLP

Twitter-NLP is a graduate project from Alan Ritter. The project was designed to extract meaningful world event data from massive amounts of public twitter posts. For example, if 1000 people tweet about the iPhone 6S coming out on October 12, the nlp will be able to decipher the important entities from those posts and the date and create an event on the calendar. A demo calendar is available at http://statuscalendar.com

His project is available on GitHub at https://github.com/aritter/twitter_nlp

He has a few papers and presentations on the research and the final product:
Open Domain Event Extraction from Twitter
Named Entity Recognition in Tweets: An Experimental Study
Unsupervised Modeling of Twitter Conversations
Talk: Extracting Knowledge from Informal Text

Setup

First, clone the repository from https://github.com/aritter/twitter_nlp.
Twitter-NLP runs exclusively on Linux. We tested our results on Ubuntu 14.04 LTS.
It also requires Python, a Java Runtime Environment, and gcc-c++ packages.

# gcc-c++  
sudo apt-get install gcc  
# Python  
sudo apt-get install python
# Java  
sudo apt-get install default-jre

The rest of the dependencies can be built by running sh build.sh

After installing the dependencies text can be run through the processor by feeding a txt file into the entityextraction.py
The repository comes with a sample of example twitter posts in test.1k.txt that is a nice set of data to run the first time. The resulting parsed text can be stored into another txt file if necessary.

 # run the example data set with the generic entity extraction  
 # store results in test results.txt  
 cat test.1k.txt | python python/ner/extractEntities2.py > testresults.txt

The output contains the tokenized and tagged words separated by spaces with tags separated by forward slash '/'
Example output:

The/B-movie Town/I-movie might/O be/O one/O/ of/O the/O best/O movies/O I/O have/O seen/O all/O year/O ./O

Looking at just one word:

The/B-movie

The BIO encoding is used for encoding phrases (Named Entities, event phrases, and chunks), for example:

The/B-movie Town/I-movie might/O ...

Indicates that the word "The" begins a named entity of type movie, "Town" continues that entity, and "might" is outside of an entity mention. For more details see: http://nltk.org/book/ch07.html

Other options are available to get more specific and better results.

# run the example data set with the generic entity extraction  
# classify classifies the entities into categories  
cat test.1k.txt | python python/ner/extractEntities2.py > testresults_classify.txt  

# run the example data set with the classified entity extraction with parts of speech separated  
# pos seperates every word into parts of speech and improves the classification results
cat test.1k.txt | python python/ner/extractEntities2.py --classify --pos > testresults_classifypos.txt  

# run the example data set with the classified entity extraction and recognize event phrases
# events marks potential event phrases
cat test.1k.txt | python python/ner/extractEntities2.py --classify --pos --events > testresults_events.txt

Our Analysis

Here is an example of the results we received with one of our sample tweets:
Not sure what is open today? The Lair is open 9am-7pm, Iggy's from 11am-1am, C-Lion Levy 11am-2am, and C-Lion Del Rey 10am- 1:30am!

Entity Extraction

Not/O sure/O what/O is/O open/O today/O ?/O The/B-ENTITY Lair/I-ENTITY is/O open/O 9am-7pm/O ,/O Iggy/B-ENTITY 's/O from/O 11am-1am/O ,/O C-Lion/B-ENTITY Levy/I-ENTITY 11am-2am/O ,/O and/O C-Lion/B-ENTITY Del/I-ENTITY Rey/I-ENTITY 10am-/O 1:30/O am/O !/O

Entity Extraction with --classify --pos

Not/O/RB sure/O/JJ what/O/WP is/O/VBZ open/O/RB today/O/NN ?/O/. The/B-geo-loc/DT Lair/I-geo-loc/NNP is/O/VBZ open/O/RP 9am-7pm/O/NN ,/O/, Iggy/B-other/NNP 's/O/POS from/O/IN 11am-1am/O/CD ,/O/, C-Lion/O/NNP Levy/O/NNP 11am-2am/O/CD ,/O/, and/O/CC C-Lion/B-facility/NNP Del/I-facility/NNP Rey/I-facility/NNP 10am-/O/CD 1:30/O/CD am/O/RB !/O/.

Entity Extraction with --classify --pos --event

Not/O/RB/O sure/O/JJ/O what/O/WP/O is/O/VBZ/O open/O/RB/O today/O/NN/O ?/O/./O The/B-product/DT/O Lair/I-product/NNP/O is/O/VBZ/B-EVENT open/O/RP/I-EVENT 9am-7pm/O/NN/O ,/O/,/O Iggy/B-other/NNP/O 's/O/POS/O from/O/IN/O 11am-1am/O/CD/O ,/O/,/O C-Lion/O/NNP/O Levy/O/NNP/O 11am-2am/O/CD/O ,/O/,/O and/O/CC/O C-Lion/B-other/NNP/O Del/I-other/NNP/O Rey/I-other/NNP/O 10am-/O/CD/O 1:30/O/CD/O am/O/RB/O !/O/./O

The entity extraction is rather unpredictable. It is able to pick out the event phrase is open and also the major entities such as The Lair, Iggy, and C-Lion Del Rey. But it fails to pick up C-Lion Levy as an entity. The categorization is more unpredictable.

While the twitter-nlp does provide nice entity extraction that is specific for tweets, it does not provide temporal extraction either. Alan Ritter suggests using tempex in his paper, a common library used for event extraction from news articles.

The group decided that while twitter-nlp provided some nice results it was not what we were exactly looking for. It had unpredictable results and would be difficult to integrate with our current design. We may still use it if we decide we want to grab entities from the texts to use as keywords, but for now we are looking into training our own document classifier to actually provide tag categories for event posts.