Named entity Recognition (NER) - SoojungHong/TextMining GitHub Wiki
What is Named-entity recognition
Named-entity recognition (NER) (also known as entity identification, entity chunking and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.
Most research on NER systems has been structured as taking an unannotated block of text, such as this one:
Jim bought 300 shares of Acme Corp. in 2006.
And producing an annotated block of text that highlights the names of entities:
[Jim]Person bought 300 shares of [Acme Corp.]Organization in [2006]Time.
NER platforms
- Dataturks - An online annotation tool to build datasets for NER. Visual UI to easily build domain-specific NER datasets in a format compatible with OpenNLP, CoreNLP etc libraries.
- BRAT an open source downloadable annotation tool for NER. Requires a little bit of setup but works well for single user use cases.
- SpaCy features fast statistical NER as well as an open-source named entity visualizer.
- GATE supports NER across many languages and domains out of the box, usable via graphical interface and also Java API
- OpenNLP includes rule-based and statistical named-entity recognition
- Stanford University also has the Stanford Named Entity Recognizer
Problem definition
Full named-entity recognition is often broken down, conceptually and possibly also in implementations,[6] as two distinct problems: detection of names, and classification of the names by the type of entity they refer to (e.g. person, organization, location and other[7]). The first phase is typically simplified to a segmentation problem: names are defined to be contiguous spans of tokens, with no nesting, so that "Bank of America" is a single name, disregarding the fact that inside this name, the substring "America" is itself a name. This segmentation problem is formally similar to chunking. The second phase requires choosing an ontology by which to organize categories of things.
Approach to solve
NER systems have been created that use linguistic grammar-based techniques as well as statistical models such as machine learning. Hand-crafted grammar-based systems typically obtain better precision, but at the cost of lower recall and months of work by experienced computational linguists. Statistical NER systems typically require a large amount of manually annotated training data. Semisupervised approaches have been suggested to avoid part of the annotation effort.
Many different classifier types have been used to perform machine-learned NER, with conditional random fields being a typical choice.