ctakes core - apache/ctakes GitHub Wiki

Contains code and resources required by all or most other cTAKES modules.

Collection Readers
Annotation Engines
Output Writers
Utilities
Piper Files


Collection Readers

File Tree Reader

Reads document texts from text files in a directory tree.

Source class: FileTreeReader
Source package: org.apache.ctakes.core.cr
Parent class: org.apache.ctakes.core.cr.AbstractFileTreeReader
Products: Document Id, Document Id Prefix

Parameter Description Class Required Default
InputDirectory Directory for all input files. String Yes
CRtoSpace Change windows-format CR + LF character sequences to LF + . boolean No
Encoding The character encoding used by the input files. String No
Extensions The extensions of the files that the collection reader will read. String[] No *
KeepCR Keep windows-format carriage return characters at line endings. This will only keep existing characters, it will not add them. boolean No
PatientLevel The level in the directory hierarchy at which patient identifiers exist.Default value is 1; directly under root input directory. int No
StripQuotes Replace document-enclosing quote characters with space characters. boolean No
WriteBanner Write a large banner at each major step of the pipeline. String No no

Files in Dir Cycle Reader

Reads document texts from text files in a directory, repeating for a number of iterations.

Source class: FilesInDirectoryCollectionCyclicalReads
Source package: org.apache.ctakes.core.cr
Parent class: org.apache.ctakes.core.cr.FilesInDirectoryCollectionReader
Products: Document Id

No available configuration parameters.

Files in Dir Reader

Reads document texts from text files in a directory.

Source class: FilesInDirectoryCollectionReader
Source package: org.apache.ctakes.core.cr
Parent class: org.apache.uima.collection.CollectionReader_ImplBase
Products: Document Id

No available configuration parameters.

JDBC Note Table Reader

Reads document texts from database table's fields.

Source class: JdbcNotesReader
Source package: org.apache.ctakes.core.cr.jdbc
Parent class: org.apache.uima.fit.component.JCasCollectionReader_ImplBase
Products: Document Id

Parameter Description Class Required Default
DbDriver JDBC driver ClassName. String Yes
DbPass Password for database authentication. String Yes
DbUrl JDBC URL that specifies database network location and name. String Yes
DbUser Username for database authentication. String Yes
DocColumn Name of column that contains the document text. String Yes
SqlStatement SQL statement to retrieve the document. String Yes
BirthColumn Name of column that contains the patient birth date. String No
DateColumn Name of column that contains the document original date. String No
DbDecryptor JDBC decryptor ClassName. String No
DeathColumn Name of column that contains the patient death date. String No
DecryptPass Password for text decryption. String No
EncounterIdColumn Name of column that contains the encounter id. String No
FirstNameColumn Name of column that contains the patient first name. String No
FirstSoundexColumn Name of column that contains the patient first name soundex. String No
GenderColumn Name of column that contains the patient gender. String No
IdColumns Specifies column names that will be used to form a document ID. String[] No
IdDelimiter Specifies delimiter used when document ID is built. String No
InstanceIdColumn Name of column that contains the document instance id. String No
InstituteColumn Name of column that contains the source institution. String No
KeepAlive Flag that determines whether to keep JDBC connection open no matter what. String No
LastNameColumn Name of column that contains the patient last name. String No
LastSoundexColumn Name of column that contains the patient last name soundex. String No
MiddleNameColumn Name of column that contains the patient middle name. String No
NoteSubtypeColumn Name of column that contains the note subtype. String No
NoteTypeColumn Name of column that contains the note type. String No
PatientColumn Name of column that contains the patient identifier. String No
PatientIdColumn Name of column that contains the patient id. String No
RevisionColumn Name of column that contains the document revision number. String No
RevisionDateColumn Name of column that contains the document revision date. String No
SpecialtyColumn Name of column that contains the author specialty. String No
StandardColumn Name of column that contains the document standard. String No

JDBC Reader

Reads document texts from database text fields.

Source class: JdbcCollectionReader
Source package: org.apache.ctakes.core.cr
Parent class: org.apache.uima.fit.component.JCasCollectionReader_ImplBase
Products: Document Id

Parameter Description Class Required Default
DbConnResrcName Name of external resource for database connection. String Yes
DocTextColName Name of column from resultset that contains the document text. String Yes
SqlStatement SQL statement to retrieve the document. String Yes
DocIdColNames Specifies column names that will be used to form a document ID. String[] No
DocIdDelimiter Specifies delimiter used when document ID is built. String No
ValueFileResrcName Name of external resource for prepared statement value file. String No

Lines in File Reader

Reads a document texts from a single text file, treating each line as a document.

Source class: LinesFromFileCollectionReader
Source package: org.apache.ctakes.core.cr
Parent class: org.apache.uima.collection.CollectionReader_ImplBase
Products: Document Id

No available configuration parameters.

Lucene Field Reader

Reads document texts from Lucene text fields.

Source class: LuceneCollectionReader
Source package: org.apache.ctakes.core.cr
Parent class: org.apache.uima.fit.component.CasCollectionReader_ImplBase
Products: Document Id

Parameter Description Class Required Default
IndexDirectory Location of lucene index String Yes
FieldName Field to look in for document text String No
MaxWords Maximum number of words to process (approximate -- actually based on characters) int No

Text Files Reader

Reads document texts from text files specified in a provided list.

Source class: TextReader
Source package: org.apache.ctakes.core.cr
Parent class: org.apache.uima.fit.component.JCasCollectionReader_ImplBase
Products: Document Id

Parameter Description Class Required Default
files The text files to be loaded List Yes

XMI Reader (1)

Reads document texts and annotations from XMI files specified in a provided list.

Source class: XMIReader
Source package: org.apache.ctakes.core.cr
Parent class: org.apache.uima.fit.component.JCasCollectionReader_ImplBase
Products: Document Id

Parameter Description Class Required Default
files The XMI files to be loaded List Yes

XMI Tree Reader

Reads document texts and annotations from XMI files in a directory tree.

Source class: XmiTreeReader
Source package: org.apache.ctakes.core.cr
Parent class: org.apache.ctakes.core.cr.AbstractFileTreeReader
Products: Document Id

Parameter Description Class Required Default
InputDirectory Directory for all input files. String Yes
CRtoSpace Change windows-format CR + LF character sequences to LF + . boolean No
Encoding The character encoding used by the input files. String No
Extensions The extensions of the files that the collection reader will read. String[] No *
KeepCR Keep windows-format carriage return characters at line endings. This will only keep existing characters, it will not add them. boolean No
PatientLevel The level in the directory hierarchy at which patient identifiers exist.Default value is 1; directly under root input directory. int No
StripQuotes Replace document-enclosing quote characters with space characters. boolean No
WriteBanner Write a large banner at each major step of the pipeline. String No no

XMI in Dir Reader (1)

Reads document texts and annotations from XMI files in a directory.

Source class: XmiCollectionReaderCtakes
Source package: org.apache.ctakes.core.cr
Parent class: org.apache.uima.collection.CollectionReader_ImplBase
Products: Document Id

No available configuration parameters.


Annotation Engines

CCDA Sectionizer

Annotates Document Sections by detecting Section Headers using Regular Expressions provided in a File.

Source class: CDASegmentAnnotator
Source package: org.apache.ctakes.core.ae
Parent class: org.apache.uima.fit.component.JCasAnnotator_ImplBase
Dependencies: Document Id
Products: Section

Parameter Description Class Required Default
sections_file Path to File that contains the section header mappings String No src/user/resources/org/apache/ctakes/core/sections/ccda_sections.txt

End of Line Sentence Splitter

Re-annotates Sentences based upon short lines, preventing a Sentence from spanning over an intentional line break.

Source class: EolSentenceFixer
Source package: org.apache.ctakes.core.ae
Parent class: org.apache.uima.fit.component.JCasAnnotator_ImplBase
Dependencies: Sentence

No available configuration parameters.

LabValueFinder

Associates Lab Mentions with values.

Source class: LabValueFinder
Source package: org.apache.ctakes.core.ae
Parent class: org.apache.uima.fit.component.JCasAnnotator_ImplBase
Dependencies: Section, Base Token, Identified Annotation
Products: Generic Relation

Parameter Description Class Required Default
labTUIs TUIs indicating lab measurements String[] Yes
allSections Use all Annotatable sections. This ignores the value of sections String No true
excludeCUIs CUIs not indicating specific lab measurements String[] No
maxLineCount Maximum newlines between lab and value int No
sections Annotatable sections String[] No
useDrugs Use Medications in addition to Labs. String No false
valueWords Words indicating values String[] No

List Annotator

Annotates formatted List Sections by detecting them using Regular Expressions provided in an input File.

Source class: ListAnnotator
Source package: org.apache.ctakes.core.ae
Parent class: org.apache.uima.fit.component.JCasAnnotator_ImplBase
Dependencies: Section
Products: List

Parameter Description Class Required Default
LIST_TYPES_PATH path to a file containing a list of regular expressions and corresponding list types. String Yes org/apache/ctakes/core/list/ DefaultListRegex.bsv

List Entry Negator

Checks List Entries for negation, which may be exhibited differently from unstructured negation.

Source class: ListEntryNegator
Source package: org.apache.ctakes.core.ae
Parent class: org.apache.uima.fit.component.JCasAnnotator_ImplBase
Dependencies: List, Identified Annotation

No available configuration parameters.

List Paragraph Fixer

Re-annotates Paragraphs based upon existing Lists, preventing a Paragraph from spanning more than one List.

Source class: ListParagraphFixer
Source package: org.apache.ctakes.core.ae
Parent class: org.apache.uima.fit.component.JCasAnnotator_ImplBase
Dependencies: List, Sentence

No available configuration parameters.

List Sentence Splitter

Re-annotates Sentences based upon existing List Entries, preventing a Sentence from spanning more than one List Entry.

Source class: ListSentenceFixer
Source package: org.apache.ctakes.core.ae
Parent class: org.apache.uima.fit.component.JCasAnnotator_ImplBase
Dependencies: List, Sentence

No available configuration parameters.

PTB Tokenizer

Annotates Document Penn TreeBank Tokens.

Source class: TokenizerAnnotatorPTB
Source package: org.apache.ctakes.core.ae
Parent class: org.apache.uima.fit.component.JCasAnnotator_ImplBase
Dependencies: Section, Sentence
Products: Base Token

Parameter Description Class Required Default
SegmentsToSkip Set of segments that can be skipped String[] No

Paragraph Annotator

Annotates Paragraphs by detecting them using Regular Expressions provided in an input File or by empty text lines.

Source class: ParagraphAnnotator
Source package: org.apache.ctakes.core.ae
Parent class: org.apache.uima.fit.component.JCasAnnotator_ImplBase
Dependencies: Section
Products: Paragraph

Parameter Description Class Required Default
PARAGRAPH_TYPES_PATH path to a file containing a list of regular expressions and corresponding paragraph types. String No

Paragraph Sentence Splitter

Re-annotates Sentences based upon existing Paragraphs, preventing a Sentence from spanning more than one Paragraph.

Source class: ParagraphSentenceFixer
Source package: org.apache.ctakes.core.ae
Parent class: org.apache.uima.fit.component.JCasAnnotator_ImplBase
Dependencies: Paragraph, Sentence

No available configuration parameters.

Prose Sentence Detector

Sentence detector that uses B I O for determination. Useful for documents in which newlines may not indicate sentence boundaries.

Source class: SentenceDetectorAnnotatorBIO
Source package: org.apache.ctakes.core.ae
Parent class: org.cleartk.ml.CleartkAnnotator
Dependencies: Section
Products: Sentence

Parameter Description Class Required Default
classifierFactoryClassName provides the full name of the ClassifierFactory class to be used. String No org.cleartk.ml.jar. JarClassifierFactory
dataWriterFactoryClassName provides the full name of the DataWriterFactory class to be used. String No org.cleartk.ml.jar. DefaultDataWriterFactory
FeatureConfiguration FEAT_CONFIG No
isTraining determines whether this annotator is writing training data or using a classifier to annotate. Normally inferred automatically based on whether or not a DataWriterFactory class has been set. Boolean No
TokenFilename String No

Regex Sectionizer

Annotates Document Sections by detecting Section Headers using Regular Expressions provided in a Bar-Separated-Value (BSV) File.

Source class: BsvRegexSectionizer
Source package: org.apache.ctakes.core.ae
Parent class: org.apache.ctakes.core.ae.RegexSectionizer
Products: Section

Parameter Description Class Required Default
SectionsBsv path to a BSV file containing a list of regular expressions and corresponding section types. String Yes org/apache/ctakes/core/sections/ DefaultSectionRegex.bsv
TagDividers True if lines of divider characters ____ , ---- , === should divide sections boolean No true

Sectionizer

Annotates Document Sections by detecting Section Headers in template.

Source class: SectionSegmentAnnotator
Source package: org.apache.ctakes.core.ae
Parent class: org.apache.uima.analysis_component.JCasAnnotator_ImplBase
Products: Section

No available configuration parameters.

Sentence Detector

Annotates Sentences based upon an OpenNLP model.

Source class: SentenceDetector
Source package: org.apache.ctakes.core.ae
Parent class: org.apache.uima.fit.component.JCasAnnotator_ImplBase
Dependencies: Section
Products: Sentence

Parameter Description Class Required Default
SentenceModelFile Path to sentence detector model file String Yes org/apache/ctakes/core/models/sentdetect/ sd-med-model.zip
SegmentsToSkip Set of segments that can be skipped String[] No

Single Sectionizer

Annotates Document as a single Section.

Source class: SimpleSegmentAnnotator
Source package: org.apache.ctakes.core.ae
Parent class: org.apache.uima.fit.component.JCasAnnotator_ImplBase
Products: Section

Parameter Description Class Required Default
SegmentID Name to give to all segments String No SIMPLE_SEGMENT

Tag Sectionizer

Annotates Document Sections by detecting start and end Section Tags.

Source class: SimpleSegmentWithTagsAnnotator
Source package: org.apache.ctakes.core.ae
Parent class: org.apache.uima.analysis_component.JCasAnnotator_ImplBase
Products: Section

No available configuration parameters.

Thread Safe Sentence Detector

Annotates Sentences based upon an OpenNLP model.

Source class: ThreadSafeSentenceDetector
Source package: org.apache.ctakes.core.concurrent
Parent class: org.apache.ctakes.core.ae.SentenceDetector
Dependencies: Section
Products: Sentence

Parameter Description Class Required Default
SentenceModelFile Path to sentence detector model file String Yes org/apache/ctakes/core/models/sentdetect/ sd-med-model.zip
SegmentsToSkip Set of segments that can be skipped String[] No

Thread Safe Sentence Detector BIO

Thread safe sentence detector that uses B I O for determination. Useful for documents in which newlines may not indicate sentence boundaries.

Source class: ThreadSafeSentenceDetectorBio
Source package: org.apache.ctakes.core.concurrent
Parent class: org.apache.ctakes.core.ae.SentenceDetectorAnnotatorBIO
Dependencies: Section
Products: Sentence

Parameter Description Class Required Default
classifierFactoryClassName provides the full name of the ClassifierFactory class to be used. String No org.cleartk.ml.jar. JarClassifierFactory
dataWriterFactoryClassName provides the full name of the DataWriterFactory class to be used. String No org.cleartk.ml.jar. DefaultDataWriterFactory
FeatureConfiguration FEAT_CONFIG No
isTraining determines whether this annotator is writing training data or using a classifier to annotate. Normally inferred automatically based on whether or not a DataWriterFactory class has been set. Boolean No
TokenFilename String No

Tokenizer

Annotates Document Tokens.

Source class: TokenizerAnnotator
Source package: org.apache.ctakes.core.ae
Parent class: org.apache.uima.analysis_component.JCasAnnotator_ImplBase
Dependencies: Section
Products: Base Token

No available configuration parameters.


Output Writers

CUI Count Writer

Writes a two-column BSV file containing CUIs and their total counts in a document.

Source class: CuiCountFileWriter
Source package: org.apache.ctakes.core.cc
Parent class: org.apache.uima.fit.component.CasConsumer_ImplBase
Dependencies: Document Id, Identified Annotation

Parameter Description Class Required Default
OutputDirectory Directory for all output files. String No

CUI List Writer

Writes a list of CUIs, covered text and preferred text to files.

Source class: CuiListFileWriter
Source package: org.apache.ctakes.core.cc
Parent class: org.apache.ctakes.core.cc.AbstractJCasFileWriter
Dependencies: Document Id, Sentence, Base Token
Usables: Document Id Prefix, Identified Annotation, Event, Timex, Temporal Relation

Parameter Description Class Required Default
OutputDirectory Directory for all output files. File Yes
SubDirectory SubDirectory for files. String No

Document Text Writer

Writes Text files with original text from the document.

Source class: FilesInDirectoryCasConsumer
Source package: org.apache.ctakes.core.cc
Parent class: org.apache.uima.collection.CasConsumer_ImplBase
Dependencies: Document Id

No available configuration parameters.

Document Text Writer (Dir)

Writes Text files with original text from the document in a specified directory.

Source class: NormalizedFilesInDirectoryCasConsumer
Source package: org.apache.ctakes.core.cc
Parent class: org.apache.uima.collection.CasConsumer_ImplBase
Dependencies: Document Id, Base Token

No available configuration parameters.

HTML Table Writer

Writes HTML files with a Table representation of extracted information.

Source class: HtmlTableCasConsumer
Source package: org.apache.ctakes.core.cc
Parent class: org.apache.uima.collection.CasConsumer_ImplBase
Dependencies: Base Token

No available configuration parameters.

HTML Writer

Writes html files with document text and simple markups (Semantic Group, CUI, Negation).

Source class: HtmlTextWriter
Source package: org.apache.ctakes.core.cc.html
Parent class: org.apache.ctakes.core.cc.AbstractJCasFileWriter
Dependencies: Document Id, Sentence, Base Token
Usables: Document Id Prefix, Identified Annotation, Event, Timex, Temporal Relation

Parameter Description Class Required Default
OutputDirectory Directory for all output files. File Yes
SubDirectory SubDirectory for files. String No

HTML Writer

Writes html files with document text and simple markups (Semantic Group, CUI, Negation).

Source class: HtmlTextWriter
Source package: org.apache.ctakes.core.cc.pretty.html
Parent class: org.apache.ctakes.core.cc.AbstractJCasFileWriter
Dependencies: Document Id, Sentence, Base Token
Usables: Document Id Prefix, Identified Annotation, Event, Timex, Temporal Relation

Parameter Description Class Required Default
OutputDirectory Directory for all output files. File Yes
SubDirectory SubDirectory for files. String No

I2b2JdbcWriter

Writes UMLS Concepts to a standard I2B2 Observation_Fact table.

Source class: I2b2JdbcWriter
Source package: org.apache.ctakes.core.cc.jdbc.i2b2
Parent class: org.apache.ctakes.core.cc.jdbc.AbstractJCasJdbcWriter
Dependencies: Identified Annotation

Parameter Description Class Required Default
DbDriver JDBC driver ClassName. String Yes
DbPass Password for database authentication. String Yes
DbUrl JDBC URL that specifies database network location and name. String Yes
DbUser Username for database authentication. String Yes
FactOutputTable Name of the Observation_Fact table for writing output. String Yes
BatchSize Number of statements to use in a batch. 0 or 1 denotes that batches should not be used. String No
KeepAlive Flag that determines whether to keep JDBC connection open no matter what. String No
RepeatCuis Repeat Concepts with the same Cui but possibly different Semantic Type or Preferred Text. boolean No

JDBC Writer (Template)

Stores extracted information and document metadata in a database.

Source class: JdbcWriterTemplate
Source package: org.apache.ctakes.core.cc
Parent class: org.apache.ctakes.core.cc.AbstractJdbcWriter
Dependencies: Document Id, Identified Annotation

No available configuration parameters.

Medication Table Writer

Writes a table of Medication information to file, sorted by character index.

Source class: MedicationTableFileWriter
Source package: org.apache.ctakes.core.cc
Parent class: org.apache.ctakes.core.cc.AbstractTableFileWriter
Dependencies: Document Id, Identified Annotation
Usables: Document Id Prefix

Parameter Description Class Required Default
OutputDirectory Directory for all output files. File Yes
SubDirectory SubDirectory for files. String No
TableType Type of Table to write to File. Possible values are: BSV, CSV, HTML, TAB String No

Pretty Text Writer

Writes text files with document text and simple markups (POS, Semantic Group, CUI, Negation).

Source class: PrettyTextWriterFit
Source package: org.apache.ctakes.core.cc.pretty.plaintext
Parent class: org.apache.ctakes.core.cc.AbstractJCasFileWriter
Dependencies: Document Id, Sentence, Base Token
Usables: Document Id Prefix, Identified Annotation, Event, Timex, Temporal Relation

Parameter Description Class Required Default
OutputDirectory Directory for all output files. File Yes
SubDirectory SubDirectory for files. String No

Pretty Text Writer (UIMA)

Writes text files with document text and simple markups (POS, Semantic Group, CUI, Negation).

Source class: PrettyTextWriterUima
Source package: org.apache.ctakes.core.cc.pretty.plaintext
Parent class: org.apache.uima.collection.CasConsumer_ImplBase
Dependencies: Document Id, Sentence, Base Token
Usables: Identified Annotation, Event, Timex, Temporal Relation

No available configuration parameters.

Property Text Writer

Writes text files with lists of annotations and properties (POS, Semantic Group, CUI, Negation).

Source class: PropertyTextWriterFit
Source package: org.apache.ctakes.core.cc.property.plaintext
Parent class: org.apache.uima.fit.component.CasConsumer_ImplBase
Dependencies: Document Id, Sentence, Identified Annotation

Parameter Description Class Required Default
OutputDirectory Directory for all output files. String No

Property Text Writer (UIMA)

Writes text files with lists of annotations and properties (POS, Semantic Group, CUI, Negation).

Source class: PropertyTextWriterUima
Source package: org.apache.ctakes.core.cc.property.plaintext
Parent class: org.apache.uima.collection.CasConsumer_ImplBase
Dependencies: Document Id, Sentence, Identified Annotation

No available configuration parameters.

Semantic Table Writer

Writes a table of Annotation information to file, grouped by Semantic Type.

Source class: SemanticTableFileWriter
Source package: org.apache.ctakes.core.cc
Parent class: org.apache.ctakes.core.cc.AbstractTableFileWriter
Dependencies: Document Id, Identified Annotation
Usables: Document Id Prefix

Parameter Description Class Required Default
OutputDirectory Directory for all output files. File Yes
SubDirectory SubDirectory for files. String No
TableType Type of Table to write to File. Possible values are: BSV, CSV, HTML, TAB String No

Sentences Writer

Writes Text files with original text from the document, sentence by sentence.

Source class: SentenceTokensPrinter
Source package: org.apache.ctakes.core.cc
Parent class: org.apache.uima.collection.CasConsumer_ImplBase
Dependencies: Document Id, Sentence, Base Token

No available configuration parameters.

Text Span Writer

Writes BSV files with original text for extracted annotations and their span offsets.

Source class: TextSpanWriter
Source package: org.apache.ctakes.core.cc
Parent class: org.apache.uima.fit.component.CasConsumer_ImplBase
Dependencies: Identified Annotation

Parameter Description Class Required Default
OutputDirectory Directory for all output files. String No

Token Offset Writer

Writes a two-column BSV file containing Begin and End offsets of tokens in a document.

Source class: TokenOffsetsCasConsumer
Source package: org.apache.ctakes.core.cc
Parent class: org.apache.uima.collection.CasConsumer_ImplBase
Dependencies: Document Id, Base Token

No available configuration parameters.

Token Table Writer

Writes a table of base tokens and their spans in a directory tree.

Source class: TokenTableFileWriter
Source package: org.apache.ctakes.core.cc
Parent class: org.apache.ctakes.core.cc.AbstractTableFileWriter
Usables: Document Id Prefix, Base Token

Parameter Description Class Required Default
OutputDirectory Directory for all output files. File Yes
SubDirectory SubDirectory for files. String No
TableType Type of Table to write to File. Possible values are: BSV, CSV, HTML, TAB String No

Word Count Writer

Writes a two-column BSV file containing Words and their total counts in a document.

Source class: TokenFreqCasConsumer
Source package: org.apache.ctakes.core.cc
Parent class: org.apache.uima.collection.CasConsumer_ImplBase
Dependencies: Base Token

No available configuration parameters.

XMI Writer

Writes XMI files with full representation of input text and all extracted information.

Source class: XmiWriterCasConsumerCtakes
Source package: org.apache.ctakes.core.cc
Parent class: org.apache.uima.fit.component.CasConsumer_ImplBase
Dependencies: Document Id

Parameter Description Class Required Default
OutputDirectory Output directory to write xmi files File Yes

XMI Writer (Dir Tree)

Writes XMI files with full representation of input text and all extracted information.

Source class: FileTreeXmiWriter
Source package: org.apache.ctakes.core.cc
Parent class: org.apache.ctakes.core.cc.AbstractJCasFileWriter
Dependencies: Document Id
Usables: Document Id Prefix

Parameter Description Class Required Default
OutputDirectory Directory for all output files. File Yes
SubDirectory SubDirectory for files. String No

XMI Writer 2

Writes XMI files with full representation of input text and all extracted information.

Source class: CasConsumer
Source package: org.apache.ctakes.core.cc
Parent class: org.apache.uima.collection.CasConsumer_ImplBase
Dependencies: Document Id

No available configuration parameters.


Utilities

Annotation Remover

Removes annotations of a given type from the JCas.

Source class: FilterAnnotator
Source package: org.apache.ctakes.core.ae
Parent class: org.apache.uima.analysis_component.JCasAnnotator_ImplBase
Dependencies: Base Token

No available configuration parameters.

CommandRunner

Runs an external process.

Source class: CommandRunner
Source package: org.apache.ctakes.core.ae
Parent class: org.apache.ctakes.core.ae.AbstractCommandRunner

Parameter Description Class Required Default
OutputDirectory Directory for all output files. File Yes
Command A full command line to be executed. Make sure to quote. String No
CommandDir The Command Executable's directory. String No
Log A name for the streaming logger. Default is the Command. String No
LogFile File to which cTAKES output should be sent. String No
Pause Pause for some seconds. Default is 0 int No
PerDoc yes to run the command once per document. Default is no. String No no
SetJavaHome Set JAVA_HOME to the Java running cTAKES. Default is yes. String No yes
Wait Wait for the process to finish. Default is no. String No no
WorkingDir The Working Directory directory. String No

CtakesRunner

Starts a new instance of cTAKES with the given piper parameters.

Source class: CtakesRunner
Source package: org.apache.ctakes.core.ae
Parent class: org.apache.ctakes.core.ae.PausableFileLoggerAE

Parameter Description Class Required Default
OutputDirectory Directory for all output files. File Yes
Pipeline Piper parameters. Make sure to quote. String Yes
LogFile File to which cTAKES output should be sent. String No
Pause Pause for some seconds. Default is 0 int No
Wait Wait for the process to finish. Default is no. String No no

Deprecated Finished Logger

use FinishedLogger in (sub) package log.

Source class: FinishedLogger
Source package: org.apache.ctakes.core.util
Parent class: org.apache.uima.fit.component.JCasAnnotator_ImplBase

No available configuration parameters.

Document ID Printer

Logs the Document ID to Log4j and Standard Output.

Source class: DocumentIdPrinterAnalysisEngine
Source package: org.apache.ctakes.core.ae
Parent class: org.apache.uima.fit.component.JCasAnnotator_ImplBase
Dependencies: Document Id

No available configuration parameters.

ExitForcer

Forcibly Exits cTAKES. Use only at the end of a pipeline.

Source class: ExitForcer
Source package: org.apache.ctakes.core.ae
Parent class: org.apache.ctakes.core.ae.inert.PausableAE

Parameter Description Class Required Default
ForceExit Forcibly exits the system when the value is yes. Yes by default. String No yes
Pause Pause for some seconds. Default is 0 int No
Wait Wait for the process to finish. Default is no. String No no

Finished Logger

Writes a banner message COMPLETE to the log when all processing is finished.

Source class: FinishedLogger
Source package: org.apache.ctakes.core.util.log
Parent class: org.apache.uima.fit.component.JCasAnnotator_ImplBase

No available configuration parameters.

JCas Copy Annotator

Copies document text and all annotations into a new JCas.

Source class: CopyAnnotator
Source package: org.apache.ctakes.core.ae
Parent class: org.apache.uima.fit.component.JCasAnnotator_ImplBase

Parameter Description Class Required Default
dataBindMap Mapping between source methods and destination methods in a bar (" ") separated format String[] Yes
destObjClass Name of destination class String Yes
srcObjClass Name of source class String Yes

Knowtator XML Reader (SHARP)

Reads annotations from SHARP schema Knowtator XML files in a directory.

Source class: SHARPKnowtatorXMLReader
Source package: org.apache.ctakes.core.ae
Parent class: org.apache.uima.fit.component.JCasAnnotator_ImplBase
Products: Identified Annotation, Event, Timex, Location Relation, Degree Relation, Temporal Relation

Parameter Description Class Required Default
SetDefaults whether or not to set default attribute values if no annotation is present boolean Yes
TextDirectory directory containing the text files (if DocumentIDs are just filenames); defaults to assuming that DocumentIDs are full file paths File No

MrsDrSentenceJoiner

Joins Sentences with person titles Mr. Mrs. Dr. that have been split by SentenceDetectorBIO.

Source class: MrsDrSentenceJoiner
Source package: org.apache.ctakes.core.ae
Parent class: org.apache.uima.fit.component.JCasAnnotator_ImplBase
Dependencies: Sentence

No available configuration parameters.

Null Annotator

Does absolutely nothing.

Source class: NullAnnotator
Source package: org.apache.ctakes.core.ae
Parent class: org.apache.uima.analysis_component.JCasAnnotator_ImplBase

No available configuration parameters.

Overlap Annotator

Removes or modifies annotations that overlap.

Source class: OverlapAnnotator
Source package: org.apache.ctakes.core.ae
Parent class: org.apache.uima.analysis_component.JCasAnnotator_ImplBase
Dependencies: Base Token

No available configuration parameters.

PatientNoteCollector

Caches each Document JCas in a Patient JCas as a View.

Source class: PatientNoteCollector
Source package: org.apache.ctakes.core.patient
Parent class: org.apache.uima.fit.component.JCasAnnotator_ImplBase

No available configuration parameters.

PiperFileRunEngine

Analysis Engine that executes the PiperFileRunner. Kludge for desc files (CPE).

Source class: PiperFileRunEngine
Source package: org.apache.ctakes.core.ae
Parent class: org.apache.uima.fit.component.JCasAnnotator_ImplBase

Parameter Description Class Required Default
PiperParams Command Line Parameters normally used to run a piper file. String Yes

PythonPipper

Will pip a specified python package.

Source class: PythonPipper
Source package: org.apache.ctakes.core.ae
Parent class: org.apache.ctakes.core.ae.PythonRunner

Parameter Description Class Required Default
OutputDirectory Directory for all output files. File Yes
PipPackage Path of the python package to pip. String Yes
Command A full command line to be executed. Make sure to quote. String No
CommandDir The Command Executable's directory. String No
Log A name for the streaming logger. Default is the Command. String No
LogFile File to which cTAKES output should be sent. String No
Pause Pause for some seconds. Default is 0 int No
PerDoc yes to run the command once per document. Default is no. String No no
VirtualEnv Path to Python virtual environment. String No
Wait Wait for the process to finish. Default is no. String No no
WorkingDir The Working Directory directory. String No

PythonRunner

Starts a Python process with the given parameters.

Source class: PythonRunner
Source package: org.apache.ctakes.core.ae
Parent class: org.apache.ctakes.core.ae.AbstractCommandRunner

Parameter Description Class Required Default
OutputDirectory Directory for all output files. File Yes
Command A full command line to be executed. Make sure to quote. String No
CommandDir The Command Executable's directory. String No
Log A name for the streaming logger. Default is the Command. String No
LogFile File to which cTAKES output should be sent. String No
Pause Pause for some seconds. Default is 0 int No
PerDoc yes to run the command once per document. Default is no. String No no
VirtualEnv Path to Python virtual environment. String No
Wait Wait for the process to finish. Default is no. String No no
WorkingDir The Working Directory directory. String No

Start or Finish Logger

Simple Annotator to place before and after other annotators that do not Log their Start and Finish.

Source class: StartFinishLogger
Source package: org.apache.ctakes.core.ae
Parent class: org.apache.uima.fit.component.JCasAnnotator_ImplBase

Parameter Description Class Required Default
LOGGER_NAME provides the full name of the Annotator Engine for which start / end logging should be done. String Yes StartEndProgressLogger
IS_START indicates whether this should log a start. Boolean No
LOGGER_TASK provides the descriptive purpose of the Annotator Engine for which start / end logging should be done. String No Processing ...

Piper Files

Default Tokenizer Pipeline

Commands and parameters for a small tokenization pipeline.

Default Tokenizer Pipeline

$\textcolor{gray}{\textsf{// Commands and parameters for a small tokenization pipeline. }}$

$\textcolor{green}{\textbf{add}}$ SimpleSegmentAnnotator
$\textcolor{green}{\textbf{add}}$ SentenceDetector
$\textcolor{green}{\textbf{add}}$ TokenizerAnnotatorPTB

Full Tokenizer Pipeline

Commands and parameters for a small tokenization pipeline with sections, paragraphs and lists.

Full Tokenizer Pipeline

$\textcolor{gray}{\textsf{// Commands and parameters for a small tokenization pipeline with sections, paragraphs and lists. }}$

$\textcolor{gray}{\textsf{// Annotate sections by known regex }}$
$\textcolor{green}{\textbf{add}}$ BsvRegexSectionizer

$\textcolor{gray}{\textsf{// The sentence detector needs our custom model path, otherwise default values are used. }}$
$\textcolor{gray}{\textsf{//add SentenceDetectorAnnotatorBIO classifierJarPath=/org/apache/ctakes/core/models/sentdetect/model.jar }}$

$\textcolor{gray}{\textsf{// The SentenceDetectorAnnotatorBIO is a "lumper" that works well for notes in which end of line does not indicate a sentence. }}$
$\textcolor{gray}{\textsf{// If that is not your case, then you may get better results using the more standard SentenceDetector }}$
$\textcolor{green}{\textbf{add}}$ SentenceDetector

$\textcolor{gray}{\textsf{// By default, paragraphs are parsed using empty lines as separators and Part \#: }}$
$\textcolor{green}{\textbf{add}}$ ParagraphAnnotator
$\textcolor{gray}{\textsf{// Fix sentences so that no sentence spans across two or more paragraphs. }}$
$\textcolor{green}{\textbf{add}}$ ParagraphSentenceFixer

$\textcolor{gray}{\textsf{// Use regular expressions created for the Pitt notes to discover formatted lists and tables. }}$
$\textcolor{green}{\textbf{add}}$ ListAnnotator
$\textcolor{gray}{\textsf{// Fix sentences so that no sentence spans across two or more list entries. }}$
$\textcolor{green}{\textbf{add}}$ ListSentenceFixer

$\textcolor{gray}{\textsf{// Now we can finally tokenize, tag parts of speech and chunk using adjusted sentences. }}$
$\textcolor{green}{\textbf{add}}$ TokenizerAnnotatorPTB

Ts Default Tokenizer Pipeline

Commands and parameters for a small thread-safe tokenization pipeline.

Ts Default Tokenizer Pipeline

$\textcolor{gray}{\textsf{// Commands and parameters for a small thread-safe tokenization pipeline. }}$

$\textcolor{green}{\textbf{add}}$ SimpleSegmentAnnotator
$\textcolor{green}{\textbf{add}}$ $\textcolor{blue}{\textsf{concurrent.ThreadSafeSentenceDetector}}$
$\textcolor{green}{\textbf{add}}$ TokenizerAnnotatorPTB

Ts Full Tokenizer Pipeline

Commands and parameters for a small thread-safe tokenization pipeline with sections, paragraphs and lists.

Ts Full Tokenizer Pipeline

$\textcolor{gray}{\textsf{// Commands and parameters for a small thread-safe tokenization pipeline with sections, paragraphs and lists. }}$

$\textcolor{gray}{\textsf{// Annotate sections by known regex }}$
$\textcolor{green}{\textbf{add}}$ BsvRegexSectionizer

$\textcolor{gray}{\textsf{// The sentence detector needs our custom model path, otherwise default values are used. }}$
$\textcolor{gray}{\textsf{//add concurrent.ThreadSafeSentenceDetectorBio classifierJarPath=/org/apache/ctakes/core/models/sentdetect/model.jar }}$

$\textcolor{gray}{\textsf{// The SentenceDetectorAnnotatorBIO is a "lumper" that works well for notes in which end of line does not indicate a sentence. }}$
$\textcolor{gray}{\textsf{// If that is not your case, then you may get better results using the more standard SentenceDetector }}$
$\textcolor{green}{\textbf{add}}$ $\textcolor{blue}{\textsf{concurrent.ThreadSafeSentenceDetector}}$

$\textcolor{gray}{\textsf{// By default, paragraphs are parsed using empty lines as separators and Part \#: }}$
$\textcolor{green}{\textbf{add}}$ ParagraphAnnotator
$\textcolor{gray}{\textsf{// Fix sentences so that no sentence spans across two or more paragraphs. }}$
$\textcolor{green}{\textbf{add}}$ ParagraphSentenceFixer

$\textcolor{gray}{\textsf{// Use regular expressions created for the Pitt notes to discover formatted lists and tables. }}$
$\textcolor{green}{\textbf{add}}$ ListAnnotator
$\textcolor{gray}{\textsf{// Fix sentences so that no sentence spans across two or more list entries. }}$
$\textcolor{green}{\textbf{add}}$ ListSentenceFixer

$\textcolor{gray}{\textsf{// Now we can finally tokenize, tag parts of speech and chunk using adjusted sentences. }}$
$\textcolor{green}{\textbf{add}}$ TokenizerAnnotatorPTB

⚠️ **GitHub.com Fallback** ⚠️