Evaluating OpenNLP - Texera/texera GitHub Wiki

Author: Hailey Pan

Reviewed by: Chen Li

##Example Code

sandbox/src/main/java/edu/uci/ics/texera/sandbox/OpenNLPexample
https://github.com/Texera/texera/pull/308

##Introduction

The Apache OpenNLP library (https://opennlp.apache.org/) is a machine learning based toolkit for the processing of natural language text.

Both Stanford CoreNLP and OpenNLP require tokenization before doing any extraction. Stanford CoreNLP can extract all tags specified by a single annotator (Figure 1). On the other hand, OpenNLP has a trained model for each specific tag, i.e., it has a model for POS tags, a model for location NER tags, a model for person NER tags, and so on.

Figure 1: Stanford NLP Sample Code: annotation property setup

In terms of named-entity recognition, Stanford CoreNLP works better on general-purpose text. OpenNLP might be a better choice when one wants to extract information from text by using their own models trained on a corpus.

##Models

Available models can be found at http://opennlp.sourceforge.net/models-1.5/. For instance, the "Part of Speech Tagger" marks tokens with their corresponding word types based on its context. A token might have multiple POS tags. The tagger uses a probability model to predict the correct POS tags. To limit the number of possible tags for a token, a tag dictionary can be used, which increases the runtime of the tagger.

##Performance & Accuracy

Dataset: abstract_100.txt

Machine: 2016 MacBook Pro with 8GB RAM and 256GB SSD

Runtime (tokenizing runtime included)

	OpenNLP	Stanford CoreNLP
POS	11.65s	2.69s
NER	11.26s	18.04s

Stanford CoreNLP is much more efficient than OpenNLP in POS tagging. OpenNLP runs faster than Stanford CoreNLP in NER tagging. However, Stanford CoreNLP extracts all NER tags while OpenNLP extracts only location tags.

POSTag Results

Most irrelevant results (e.g., punctuations) are eliminated for both packages.

OpenNLP	Stanford CoreNLP
26,360 results	25,919 Results

The results produced by these two tools are very similar. The difference between their number of results may be due to their own tokenizer. Stanford CoreNLP tokenizer does a better job in handling punctuations. For example, OpenNLP recognizes [.”] as a token and tags it as coordinating conjunction ("CC"), but Stanford NLP would not tag it. This is the main reason that OpenNLP produces more results than Stanford NLP. Therefore, Stanford CoreNLP may have a higher accuracy over OpenNLP because of more accurate tokenization.

More results are available in the following links:

Stanford CoreNLP: https://drive.google.com/open?id=0B-d4eVox97e3UXpUMUpfckxSb28
OpenNLP: https://drive.google.com/open?id=0B-d4eVox97e3bnZyT2gzX1dJX3c

NER Results

OpenNLP uses a location NER model only, so we only compare the location NER results. OpenNLP provides results as offsets, e.g., “New York” as a result, while Stanford CoreNLP produces “New” and “York.” For easy comparison, the results of OpenNLP are separated word by word. OpenNLP cannot figure out abbreviations that contain punctuations of a location name while Stanford NLP can. For example, OpenNLP doesn’t tag “N.Y.” but Stanford NLP does. Also, Stanford NLP can recognize non-English alphabetical-based words, while OpenNLP needs another model to do it. Overall, Stanford CoreNLP tends to be more accurate.

OpenNLP	Stanford CoreNLP
150 results	173 Results

More results are available in the following links:

Stanford CoreNLP: https://drive.google.com/open?id=0B-d4eVox97e3Ym5HRm9nbEdiSk0
OpenNLP: https://drive.google.com/open?id=0B-d4eVox97e3OHNfRmNZcU92ZGs

##Popularity of OpenNLP

Source: https://java.libhunt.com/project/corenlp/vs/apache-opennlp