coreference_desc - apache/ctakes GitHub Wiki

This module performs coreference resolution for several types of coreference, excluding person mentions and some rare pronouns.

Installation: This module contains a number of references to other cTAKES modules, especially the Constituency Parser. The links inside this project use relative path names so they should be portable as long as all modules are placed in the same directory.

Types:

Most basically, the output of this module will be several data types added to the CAS representing the output of the system. These types are as follows:

Markable - Subtyped into NEMarkable (Named entities), PronounMarkable (pronouns), and DemMarkable (certain demonstrative and relative pronouns), these are automatically discovered and taken as input to the coreference resolution algorithm. These are types required above the SHARP types for entities due to some special considerations with span differences and differing type inheritances.

CoreferenceRelation - A type containing two Markables that are believed to co-refer. A CoreferenceRelation has two arguments of type RelationArgument, with a role field containing a value of either "anaphor" or "antecedent." There is also an "argument" field which contains the Markable fulfilling the role.

CollectionTextRelation - A linked list containing chains of Annotations that the classifier says refer to the same entity. This is derived from the set of CoreferenceRelation elements described above. It contains a list of UIMA type NonEmptyFSList, as well as a size field. For singletons there are lists of length 1. For actual chains the size will be different, and each node in the list is of type NonEmptyFSList. That type has a head and tail field. The head points to the data for the node, which is a Markable, and the tail points to the next element in the list, or to a node of type EmptyFSList when the chain is complete.

UIMA Annotators: This module is released with several UIMA processing classes which can be included in pipelines.

desc/analysis_engine/CorefUMLSProcessor.xml: An end-to-end aggregate annotator mainly used for demo/debugging. You can use this in the CVD (CAS VIsual Debugger) to test your setup with the following: - Run launcher resources/launcher/cvd - Load descriptor desc/analysis_engine/CorefUMLSProcessor.xml - Open file resources/testfakenote.txt - Run AE (Ctrl-R) - Inspect results - Should be 13 markables, 10 nes, 2 pronouns, and 1 dem (under Annotation index) - Should be 9 CollectionTextRelation - most are 1 element (singletons). - Chain 3 has 3 elements: "immense leg pain", "the pain", and "pain". - Chain 6 has 2 elements: "a small lesion..." and "the lesion" - Chain 8 has 2 elements: "imaging" and "which" - These chains are decomposed in the CoreferenceRelation index into pairs.

desc/collection_processing_engine/Coref-resolver_CPE.xml: This is a collection processing engine. It wraps the above AE with a collection reader and consumer. CPEs can be run with resources/launch/UIMA_CPE_coref-resolver.launch eclipse launch configuration. File->Load the CPE above, then the CPE GUI will have text boxes with associated file chooser buttons for the input and output files.

The remaining descriptor files are mostly not meant to be used independently. Please feel free to email the authors if you are curious about their usage and want help figuring it out.

If you want to use the coreference module for a pipeline of your own, the recommended method is to make a copy of CorefUMLSProcessor.xml and add any other modules you require to that pipeline. Future release will contain standalone pipelines with the minimum set of requirements, but in fact the CorefDBProcessor is pretty close to being that already -- corefererence resolution is simply dependent on a lot of earlier tasks.