Integrating Stanford NLP - Texera/texera GitHub Wiki

Author(s): Feng Hong, Yang Jiao

##Synopsys Stanford NLP package is a very powerful Java software for natural language processing. The goal is to integrate some of its features as an operator to allow users to extract Named Entities or Part of speeches.

Status

As of 6/13/2016: FINISHED

Modules

edu.uci.ics.texera.dataflow.nlpextractor

##Stanford NLP package

Stanford NLP is a set of natural language analysis tools written in Java, which annotate raw human language tokens and output forms of words, their part of speech (whether they are names of companies, people, location, etc.). The package includes a POS tagger, a syntactic parser, and a named entity recognizer. Its analyses provide the foundational building blocks for higher-level and domain-specific text-understanding applications.

The purpose of this project is to implement Stanford NLP as an extractor in Texera. We allow users to specify the NLP constant including 7 Named Entity classes and 4 types of Part of Speech entity: Number, Location, Person, Organization, Money, Percent, Date, Time, Adjective, Adverb, Noun, Verb.

Common usage of Stanford NLP package:

Name Entity Recognition: For example, names(PERSON, LOCATION, ORGANIZATION, MISC), numerical (MONEY, NUMBER, ORDINAL, PERCENT), and temporal (DATE, TIME, DURATION, SET).
Lemmatization
Part-of-Speech: Determine if a word is a noun, verb, adjective, etc.

##Presentation Slides

4/11/2016 Presentation: Project Overview

4/18/2016 Presentation: StanfordNPL introduction

4/25/2016 Presentation: [Status Report] (https://docs.google.com/presentation/d/1ek18Zr0OqQ0RONj8D7W2aSGs9sz1etnf9bEnWTEA2ag/edit?usp=sharing)

Performance Test

Machine setting: Macbook Pro (Late-2015), Intel Core i5, SSD hard drive, 8GB memory.

Data set: 100k Medline records, about 150 MB
Performance results (average time reported in seconds):

	All NamedEntities	Part of Speech
NlpExtractor	2937s	209s

On average: 34 Documents/sec for Named Entities Recognition and 480 Docs/sec for Part of Speech Recognition
Data set: 1M Medline records, about 1.5G

	All NamedEntities	Part of Speech
NlpExtractor	Too Slow	2110s

On the average, about 500 Docs/sec for Part of Speech Recognition. Slow on Named Entities Recognition.

TODOs

According to the performance test, the Named Entities extraction runs really slow. Future optimization is needed to make it faster. One possible reason is that the MEDLINE records have many fields, and we use the NLP package to process one field at a time. That means if a record have 10 fields and we want to extract information from all of them, we'll need to build 10 NLP pipelines to process them, which would need a lot of time. One way to improve that is to concatenate those fields to one then only build one pipeline to process it.