Natural Language Processing - lmucs/grapevine GitHub Wiki
We ended up using Chrono-Node, Python NLTK, and Scikit-Learn to build our natural language processing pipeline that extracts date/time information and labels events into categories:
Date Extraction
-
Chrono Node
- extracts dates and times, allowing us to decipher which posts are events
Tag Classification
-
Python NLTK
- preprocess text and tokenize
-
Scikit Learn
- convert text to features and build multi-label classifier
Of course before our pipeline was built we went through a lot of research and testing to figure out the best tools for our use.
(info from opensource-nlp-tools)
-
Stanford's Core NLP Suite - A GPL-licensed framework of tools for processing English, Chinese, and Spanish. Includes tools for tokenization (splitting of text into words), part of speech tagging, grammar parsing (identifying things like noun and verb phrases), named entity recognition, and more. Once you've got the basics, be sure to check out the other projects from the same group at Stanford.
-
Natural Language Toolkit - If your language of choice is Python, then look no further than NLTK for many of your NLP needs. Similar to the Stanford library, it includes capabilities for tokenizing, parsing, and identifying named entities as well as many more features.
-
Twitter-NLP for Events - Users of Social Networking sites frequently discuss events which will occur in the near future. By annotating Named Entities and resolving temporal expressions (for example "next Friday"), we are able to automatically extract a calendar of popular events occurring in the near future from Twitter. Check out Demo
-
Apache Lucene and Solr While not technically targeted at solving NLP problems, Lucene and Solr contain a powerful number of tools for working with text ranging from advanced string manipulation utilities to powerful and flexible tokenization libraries to blazing fast libraries for working with finite state automatons. On top of it all, you get a search engine for free!
-
Apache OpenNLP Using a different underlying approach than Stanford's library, the OpenNLP project is an Apache-licensed suite of tools to do tasks like tokenization, part of speech tagging, parsing, and named entity recognition. While not necessarily state of the art anymore in its approach, it remains a solid choice that is easy to get up and running.
-
GATE and Apache UIMA As your processing capabilities evolve, you may find yourself building complex NLP workflows which need to integrate several different processing steps. In these cases, you may want to work with a framework like GATE or UIMA that standardizes and abstracts much of the repetitive work that goes into building a complex NLP application.
- FREE BOOK Natural Language Processing with Python by Steven Bird, Ewan Klein, and Edward Loper is the definitive guide for NLTK, walking users through tasks like classification, information extraction and more.
- Written by Drew Farris, Tom Morton and yours truly, Taming Text is aimed at programmers getting started in NLP and Search. Each chapter explains the concepts behind functionality like search, named entity recognition, clustering, and classification. Each chapter also shows working examples using well-known open source projects.
- If academic rigor is what you are looking for, Christoph Manning and Hinrich Schütz's Foundations of Statistical Natural Language Processing is great place to start. It not only explains the concepts behind many of the techniques of NLP, but provides the math to back it up.
We're unlikely to find a single API that will extract all relevant information from the data. We might have a pipeline of different, specialized APIs to extract information. For example, we could run text through dateutil, a python library which extracts dates, and then pass it to meaningcloud, which will produce tags such as food or sports. Doing this will ensure we find more data, although it's possible to have discrepancies in the extracted data.
dateutil is a Python library that extracts dates from strings. Usage:
from dateutil.relativedelta import *
from dateutil.easter import *
from dateutil.rrule import *
from dateutil.parser import *
from datetime import *
date = parse("String with some date like Wednesday Sep 16 at 5pm", fuzzy=True)
This library seems promising for finding start times.
>>> date1 = parse("The pool @BurnsRecCenter is open tomorrow, 8/30 from 3pm-5pm for #LMU19! Sign the waiver at Family Fest!", fuzzy=True)
>>> date1
datetime.datetime(2015, 8, 30, 15, 0, tzinfo=tzoffset(None, -18000)) # Start time was captured
>>> date2 = parse("The Black Heritage Poetry Slam is this Thursday from 7:00-10:00 PM in the Living Room! @LMUEIS", fuzzy=True)
>>> date2
datetime.datetime(2015, 9, 17, 19, 0, tzinfo=tzoffset(None, -36000))
If no data is found it will simply return the current date. Note that the time of day is the beginning of the day.
>>> date = parse("Win a Chipotle gift card! Follow LMU Seaver on Instagram, post your pic of the new building and tag it #lmuseaver!", fuzzy=True)
>>> date
datetime.datetime(2015, 9, 17, 0, 0)
Unfortunately, if a number is not followed by an am or pm, it will not be recognized as a time. We might be able to fix this ourselves.
>>> date = parse("#GAMEDAY at Sullivan today at 4! Come out and support for our first home game of the season against Bucknell!", fuzzy=True)
>>> date
datetime.datetime(2015, 9, 17, 0, 0) # Today was captured, 4pm was not.
In this example, the string cannot be parsed because the 12 is not followed by an am or pm.
>>> mydate = parse("SFTV students! Head over to the lawn in front of Comm Arts for today's learn about the various organizations, programs, events, and get to know your fellow students! 12-1:30pm",fuzzy=True)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "dateutil/parser.py", line 1008, in parse
return DEFAULTPARSER.parse(timestr, **kwargs)
File "dateutil/parser.py", line 404, in parse
ret = default.replace(**repl)
ValueError: hour must be in 0..23
# 12 has been changed to 12pm
>>> mydate = parse("SFTV students! Head over to the lawn in front of Comm Arts for today's learn about the various organizations, programs, events, and get to know your fellow students! 12pm-1:30pm",fuzzy=True)
datetime.datetime(2015, 9, 17, 12, 0, tzinfo=tzoffset(None, -5400))
This library cannot recognize times in the form of "x mins from now".
>>> date = parse("Lions in black uniforms for first time. Ever. Tipoff in 22 min", fuzzy=True)
>>> date
datetime.datetime(2015, 9, 22, 0, 0)
It seems like hashtags and screen names cause trouble for the parser:
>>> date = parse("Lions in black uniforms for first time. Ever. #GAMEDAY. Tipoff in 22 min on @kxlu889 #LMULions #WCCHoops", fuzzy=True)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "dateutil/parser.py", line 1008, in parse
return DEFAULTPARSER.parse(timestr, **kwargs)
File "dateutil/parser.py", line 404, in parse
ret = default.replace(**repl)
ValueError: month must be in 1..12
Unfortunately, if we have multiple dates, we only keep the last one:
>>> mydate = parse("tonight at 7pm and tomorrow at 10am and Wednesday Sept 16, 2015 at 2pm", fuzzy=True)
>>> mydate
datetime.datetime(2015, 9, 16, 14, 0)
Pros:
- Can extract dates from strings
- Can find start times
Cons:
- Only finds one date at a time
- All times need to be in the form -pm or -am.
Chrono is a natural language date parser in Javascript, designed for extracting date information from any given test. Source Code
Setup: Node.js
npm install chrono-node
Bower
npm install bower
bower install chrono
Usage: Simply pass a string to funciton chrono.parseDate or chrono.parse
> var chrono = require('chrono-node')
> chrono.parseDate('An appointment on Sep 12-13')
Fri Sep 12 2014 12:00:00 GMT-0500 (CDT)
> chrono.parse('An appointment on Sep 12-13');
[ { index: 18,
text: 'Sep 12-13',
tags: { ENMonthNameMiddleEndianParser: true },
start:
{ knownValues: [Object],
impliedValues: [Object] },
end:
{ knownValues: [Object],
impliedValues: [Object] } } ]
Examples:
>var results = chrono.parse('ServeLA is Sept. 19 from 7:30 am-2:00 pm! Sign up for the free event here: http://bit.ly/1JXtzu7 @csa_lmu #ServeLA')
> results
[ { ref: Tue Sep 15 2015 16:48:06 GMT-0700 (PDT),
index: 11,
text: 'Sept. 19 from 7:30 am-2:00 pm',
tags:
{ ENMonthNameMiddleEndianParser: true,
ENTimeExpressionParser: true,
ENMergeDateAndTimeRefiner: true },
start: { knownValues: [Object], impliedValues: [Object] },
end: { knownValues: [Object], impliedValues: [Object] } } ]
> results[0].start.date()
Sat Sep 19 2015 07:30:00 GMT-0700 (PDT)
> results[0].end.date()
Sat Sep 19 2015 14:00:00 GMT-0700 (PDT)
> var results = chrono.parse('Not sure what is open today? The Lair is open 9am-7pm, Iggys from 11am-1am, C-Lion Levy 11am-2am, and C-Lion Del Rey 10am- 1:30am!')
> results
[ { ref: Tue Sep 15 2015 17:01:01 GMT-0700 (PDT),
index: 22,
text: 'today',
tags: { ENCasualDateParser: true },
start: { knownValues: [Object], impliedValues: [Object] } },
{ ref: Tue Sep 15 2015 17:01:01 GMT-0700 (PDT),
index: 46,
text: '9am-7pm',
tags: { ENTimeExpressionParser: true },
start: { knownValues: [Object], impliedValues: [Object] },
end: { knownValues: [Object], impliedValues: [Object] } },
{ ref: Tue Sep 15 2015 17:01:01 GMT-0700 (PDT),
index: 61,
text: 'from 11am-1am',
tags: { ENTimeExpressionParser: true },
start: { knownValues: [Object], impliedValues: [Object] },
end: { knownValues: [Object], impliedValues: [Object] } },
{ ref: Tue Sep 15 2015 17:01:01 GMT-0700 (PDT),
index: 88,
text: '11am-2am',
tags: { ENTimeExpressionParser: true },
start: { knownValues: [Object], impliedValues: [Object] },
end: { knownValues: [Object], impliedValues: [Object] } },
{ ref: Tue Sep 15 2015 17:01:01 GMT-0700 (PDT),
index: 117,
text: '10am- 1:30am',
tags: { ENTimeExpressionParser: true },
start: { knownValues: [Object], impliedValues: [Object] },
end: { knownValues: [Object], impliedValues: [Object] } } ]
> results[2].start.date()
Tue Sep 15 2015 11:00:00 GMT-0700 (PDT)
> results[2].end.date()
Tue Sep 15 2015 01:00:00 GMT-0700 (PDT)
Pros:
- Can recognize multiple events in one string
- Can be customized to recognize specific events (example can be found in the project's wiki)
- Output can be refined too
Cons:
- Takes the word 'today' by itself as an event in itself (can easily be ignored)
- Doesn't recognize an end time that goes to the next day as belonging to the next day. For example, The C-Lion event goes from 11am-1am, but chrono interprets the end date as being the same day than the start date