core_desc - apache/ctakes GitHub Wiki

This project contains several annotators, including:

a sentence detector annotator (a wrapper around the OpenNLP sentence detector)

a tokenizer

an annotator that does not update the CAS in any way, which can be useful if you are using the UIMA CPE GUI and you are required to specify an analysis engine but you don't actually want to specify one.

an annotator that creates a single Segment annotation encompassing the entire document text, which can be used when processing a plaintext document which therefore doesn't have section (aka segment) tags.

Of particular interest is that

End-of-line characters are considered end-of-sentence markers.

A sentence detector model is included with this project.

The model derives from a combination of GENIA, Penn Treebank (Wall Street Journal) and anonymized clinical data per Safe Harbor HIPAA guidelines. Prior to model building, the clinical data was deidentified for patient names to preserve patient confidentiality. Any person name in the model will originate from non-patient data sources.