Evaluating OpenNLP - Texera/texera GitHub Wiki
Author: Hailey Pan
Reviewed by: Chen Li
##Example Code
sandbox/src/main/java/edu/uci/ics/texera/sandbox/OpenNLPexample
- https://github.com/Texera/texera/pull/308
##Introduction
The Apache OpenNLP library (https://opennlp.apache.org/) is a machine learning based toolkit for the processing of natural language text.
Both Stanford CoreNLP and OpenNLP require tokenization before doing any extraction. Stanford CoreNLP can extract all tags specified by a single annotator (Figure 1). On the other hand, OpenNLP has a trained model for each specific tag, i.e., it has a model for POS tags, a model for location NER tags, a model for person NER tags, and so on.
Figure 1: Stanford NLP Sample Code: annotation property setup
In terms of named-entity recognition, Stanford CoreNLP works better on general-purpose text. OpenNLP might be a better choice when one wants to extract information from text by using their own models trained on a corpus.
##Models
Available models can be found at http://opennlp.sourceforge.net/models-1.5/. For instance, the "Part of Speech Tagger" marks tokens with their corresponding word types based on its context. A token might have multiple POS tags. The tagger uses a probability model to predict the correct POS tags. To limit the number of possible tags for a token, a tag dictionary can be used, which increases the runtime of the tagger.
##Performance & Accuracy
Dataset: abstract_100.txt
Machine: 2016 MacBook Pro with 8GB RAM and 256GB SSD
Runtime (tokenizing runtime included)
OpenNLP | Stanford CoreNLP | |
---|---|---|
POS | 11.65s | 2.69s |
NER | 11.26s | 18.04s |
Stanford CoreNLP is much more efficient than OpenNLP in POS tagging. OpenNLP runs faster than Stanford CoreNLP in NER tagging. However, Stanford CoreNLP extracts all NER tags while OpenNLP extracts only location tags.
POSTag Results
Most irrelevant results (e.g., punctuations) are eliminated for both packages.
OpenNLP | Stanford CoreNLP |
---|---|
26,360 results | 25,919 Results |
The results produced by these two tools are very similar. The difference between their number of results may be due to their own tokenizer. Stanford CoreNLP tokenizer does a better job in handling punctuations. For example, OpenNLP recognizes [.”] as a token and tags it as coordinating conjunction ("CC"), but Stanford NLP would not tag it. This is the main reason that OpenNLP produces more results than Stanford NLP. Therefore, Stanford CoreNLP may have a higher accuracy over OpenNLP because of more accurate tokenization.
More results are available in the following links:
- Stanford CoreNLP: https://drive.google.com/open?id=0B-d4eVox97e3UXpUMUpfckxSb28
- OpenNLP: https://drive.google.com/open?id=0B-d4eVox97e3bnZyT2gzX1dJX3c
NER Results
OpenNLP uses a location NER model only, so we only compare the location NER results. OpenNLP provides results as offsets, e.g., “New York” as a result, while Stanford CoreNLP produces “New” and “York.” For easy comparison, the results of OpenNLP are separated word by word. OpenNLP cannot figure out abbreviations that contain punctuations of a location name while Stanford NLP can. For example, OpenNLP doesn’t tag “N.Y.” but Stanford NLP does. Also, Stanford NLP can recognize non-English alphabetical-based words, while OpenNLP needs another model to do it. Overall, Stanford CoreNLP tends to be more accurate.
OpenNLP | Stanford CoreNLP |
---|---|
150 results | 173 Results |
More results are available in the following links:
- Stanford CoreNLP: https://drive.google.com/open?id=0B-d4eVox97e3Ym5HRm9nbEdiSk0
- OpenNLP: https://drive.google.com/open?id=0B-d4eVox97e3OHNfRmNZcU92ZGs
##Popularity of OpenNLP
Source: https://java.libhunt.com/project/corenlp/vs/apache-opennlp