ctakes core - apache/ctakes GitHub Wiki

Contains code and resources required by all or most other cTAKES modules.

Collection Readers
Annotation Engines
Output Writers
Utilities
Piper Files

Collection Readers

File Tree Reader

Reads document texts from text files in a directory tree.

Source class: FileTreeReader
Source package: org.apache.ctakes.core.cr
Parent class: org.apache.ctakes.core.cr.AbstractFileTreeReader
Products: Document Id, Document Id Prefix

Parameter	Description	Class	Required	Default
InputDirectory	Directory for all input files.	String	Yes
CRtoSpace	Change windows-format CR + LF character sequences to LF + .	boolean	No
Encoding	The character encoding used by the input files.	String	No
Extensions	The extensions of the files that the collection reader will read.	String[]	No	*
KeepCR	Keep windows-format carriage return characters at line endings. This will only keep existing characters, it will not add them.	boolean	No
PatientLevel	The level in the directory hierarchy at which patient identifiers exist.Default value is 1; directly under root input directory.	int	No
StripQuotes	Replace document-enclosing quote characters with space characters.	boolean	No
WriteBanner	Write a large banner at each major step of the pipeline.	String	No	no

Files in Dir Cycle Reader

Reads document texts from text files in a directory, repeating for a number of iterations.

Source class: FilesInDirectoryCollectionCyclicalReads
Source package: org.apache.ctakes.core.cr
Parent class: org.apache.ctakes.core.cr.FilesInDirectoryCollectionReader
Products: Document Id

No available configuration parameters.

Files in Dir Reader

Reads document texts from text files in a directory.

Source class: FilesInDirectoryCollectionReader
Source package: org.apache.ctakes.core.cr
Parent class: org.apache.uima.collection.CollectionReader_ImplBase
Products: Document Id

No available configuration parameters.

JDBC Note Table Reader

Reads document texts from database table's fields.

Source class: JdbcNotesReader
Source package: org.apache.ctakes.core.cr.jdbc
Parent class: org.apache.uima.fit.component.JCasCollectionReader_ImplBase
Products: Document Id

Parameter	Description	Class	Required
DbDriver	JDBC driver ClassName.	String	Yes
DbPass	Password for database authentication.	String	Yes
DbUrl	JDBC URL that specifies database network location and name.	String	Yes
DbUser	Username for database authentication.	String	Yes
DocColumn	Name of column that contains the document text.	String	Yes
SqlStatement	SQL statement to retrieve the document.	String	Yes
BirthColumn	Name of column that contains the patient birth date.	String	No
DateColumn	Name of column that contains the document original date.	String	No
DbDecryptor	JDBC decryptor ClassName.	String	No
DeathColumn	Name of column that contains the patient death date.	String	No
DecryptPass	Password for text decryption.	String	No
EncounterIdColumn	Name of column that contains the encounter id.	String	No
FirstNameColumn	Name of column that contains the patient first name.	String	No
FirstSoundexColumn	Name of column that contains the patient first name soundex.	String	No
GenderColumn	Name of column that contains the patient gender.	String	No
IdColumns	Specifies column names that will be used to form a document ID.	String[]	No
IdDelimiter	Specifies delimiter used when document ID is built.	String	No
InstanceIdColumn	Name of column that contains the document instance id.	String	No
InstituteColumn	Name of column that contains the source institution.	String	No
KeepAlive	Flag that determines whether to keep JDBC connection open no matter what.	String	No
LastNameColumn	Name of column that contains the patient last name.	String	No
LastSoundexColumn	Name of column that contains the patient last name soundex.	String	No
MiddleNameColumn	Name of column that contains the patient middle name.	String	No
NoteSubtypeColumn	Name of column that contains the note subtype.	String	No
NoteTypeColumn	Name of column that contains the note type.	String	No
PatientColumn	Name of column that contains the patient identifier.	String	No
PatientIdColumn	Name of column that contains the patient id.	String	No
RevisionColumn	Name of column that contains the document revision number.	String	No
RevisionDateColumn	Name of column that contains the document revision date.	String	No
SpecialtyColumn	Name of column that contains the author specialty.	String	No
StandardColumn	Name of column that contains the document standard.	String	No

JDBC Reader

Reads document texts from database text fields.

Source class: JdbcCollectionReader
Source package: org.apache.ctakes.core.cr
Parent class: org.apache.uima.fit.component.JCasCollectionReader_ImplBase
Products: Document Id

Parameter	Description	Class	Required
DbConnResrcName	Name of external resource for database connection.	String	Yes
DocTextColName	Name of column from resultset that contains the document text.	String	Yes
SqlStatement	SQL statement to retrieve the document.	String	Yes
DocIdColNames	Specifies column names that will be used to form a document ID.	String[]	No
DocIdDelimiter	Specifies delimiter used when document ID is built.	String	No
ValueFileResrcName	Name of external resource for prepared statement value file.	String	No

Lines in File Reader

Reads a document texts from a single text file, treating each line as a document.

Source class: LinesFromFileCollectionReader
Source package: org.apache.ctakes.core.cr
Parent class: org.apache.uima.collection.CollectionReader_ImplBase
Products: Document Id

No available configuration parameters.

Lucene Field Reader

Reads document texts from Lucene text fields.

Source class: LuceneCollectionReader
Source package: org.apache.ctakes.core.cr
Parent class: org.apache.uima.fit.component.CasCollectionReader_ImplBase
Products: Document Id

Parameter	Description	Class	Required
IndexDirectory	Location of lucene index	String	Yes
FieldName	Field to look in for document text	String	No
MaxWords	Maximum number of words to process (approximate -- actually based on characters)	int	No

Text Files Reader

Reads document texts from text files specified in a provided list.

Source class: TextReader
Source package: org.apache.ctakes.core.cr
Parent class: org.apache.uima.fit.component.JCasCollectionReader_ImplBase
Products: Document Id

Parameter	Description	Class	Required	Default
files	The text files to be loaded	List	Yes

XMI Reader (1)

Reads document texts and annotations from XMI files specified in a provided list.

Source class: XMIReader
Source package: org.apache.ctakes.core.cr
Parent class: org.apache.uima.fit.component.JCasCollectionReader_ImplBase
Products: Document Id

Parameter	Description	Class	Required	Default
files	The XMI files to be loaded	List	Yes

XMI Tree Reader

Reads document texts and annotations from XMI files in a directory tree.

Source class: XmiTreeReader
Source package: org.apache.ctakes.core.cr
Parent class: org.apache.ctakes.core.cr.AbstractFileTreeReader
Products: Document Id

Parameter	Description	Class	Required	Default
InputDirectory	Directory for all input files.	String	Yes
CRtoSpace	Change windows-format CR + LF character sequences to LF + .	boolean	No
Encoding	The character encoding used by the input files.	String	No
Extensions	The extensions of the files that the collection reader will read.	String[]	No	*
KeepCR	Keep windows-format carriage return characters at line endings. This will only keep existing characters, it will not add them.	boolean	No
PatientLevel	The level in the directory hierarchy at which patient identifiers exist.Default value is 1; directly under root input directory.	int	No
StripQuotes	Replace document-enclosing quote characters with space characters.	boolean	No
WriteBanner	Write a large banner at each major step of the pipeline.	String	No	no

XMI in Dir Reader (1)

Reads document texts and annotations from XMI files in a directory.

Source class: XmiCollectionReaderCtakes
Source package: org.apache.ctakes.core.cr
Parent class: org.apache.uima.collection.CollectionReader_ImplBase
Products: Document Id

No available configuration parameters.

Annotation Engines

CCDA Sectionizer

Annotates Document Sections by detecting Section Headers using Regular Expressions provided in a File.

Source class: CDASegmentAnnotator
Source package: org.apache.ctakes.core.ae
Parent class: org.apache.uima.fit.component.JCasAnnotator_ImplBase
Dependencies: Document Id
Products: Section

Parameter	Description	Class	Required	Default
sections_file	Path to File that contains the section header mappings	String	No	src/user/resources/org/apache/ctakes/core/sections/ccda_sections.txt

End of Line Sentence Splitter

Re-annotates Sentences based upon short lines, preventing a Sentence from spanning over an intentional line break.

Source class: EolSentenceFixer
Source package: org.apache.ctakes.core.ae
Parent class: org.apache.uima.fit.component.JCasAnnotator_ImplBase
Dependencies: Sentence

No available configuration parameters.

LabValueFinder

Associates Lab Mentions with values.

Source class: LabValueFinder
Source package: org.apache.ctakes.core.ae
Parent class: org.apache.uima.fit.component.JCasAnnotator_ImplBase
Dependencies: Section, Base Token, Identified Annotation
Products: Generic Relation

Parameter	Description	Class	Required	Default
labTUIs	TUIs indicating lab measurements	String[]	Yes
allSections	Use all Annotatable sections. This ignores the value of sections	String	No	true
excludeCUIs	CUIs not indicating specific lab measurements	String[]	No
maxLineCount	Maximum newlines between lab and value	int	No
sections	Annotatable sections	String[]	No
useDrugs	Use Medications in addition to Labs.	String	No	false
valueWords	Words indicating values	String[]	No

List Annotator

Annotates formatted List Sections by detecting them using Regular Expressions provided in an input File.

Source class: ListAnnotator
Source package: org.apache.ctakes.core.ae
Parent class: org.apache.uima.fit.component.JCasAnnotator_ImplBase
Dependencies: Section
Products: List

Parameter	Description	Class	Required	Default
LIST_TYPES_PATH	path to a file containing a list of regular expressions and corresponding list types.	String	Yes	org/apache/ctakes/core/list/ DefaultListRegex.bsv

List Entry Negator

Checks List Entries for negation, which may be exhibited differently from unstructured negation.

Source class: ListEntryNegator
Source package: org.apache.ctakes.core.ae
Parent class: org.apache.uima.fit.component.JCasAnnotator_ImplBase
Dependencies: List, Identified Annotation

No available configuration parameters.

List Paragraph Fixer

Re-annotates Paragraphs based upon existing Lists, preventing a Paragraph from spanning more than one List.

Source class: ListParagraphFixer
Source package: org.apache.ctakes.core.ae
Parent class: org.apache.uima.fit.component.JCasAnnotator_ImplBase
Dependencies: List, Sentence

No available configuration parameters.

List Sentence Splitter

Re-annotates Sentences based upon existing List Entries, preventing a Sentence from spanning more than one List Entry.

Source class: ListSentenceFixer
Source package: org.apache.ctakes.core.ae
Parent class: org.apache.uima.fit.component.JCasAnnotator_ImplBase
Dependencies: List, Sentence

No available configuration parameters.

PTB Tokenizer

Annotates Document Penn TreeBank Tokens.

Source class: TokenizerAnnotatorPTB
Source package: org.apache.ctakes.core.ae
Parent class: org.apache.uima.fit.component.JCasAnnotator_ImplBase
Dependencies: Section, Sentence
Products: Base Token

Parameter	Description	Class	Required	Default
SegmentsToSkip	Set of segments that can be skipped	String[]	No

Paragraph Annotator

Annotates Paragraphs by detecting them using Regular Expressions provided in an input File or by empty text lines.

Source class: ParagraphAnnotator
Source package: org.apache.ctakes.core.ae
Parent class: org.apache.uima.fit.component.JCasAnnotator_ImplBase
Dependencies: Section
Products: Paragraph

Parameter	Description	Class	Required	Default
PARAGRAPH_TYPES_PATH	path to a file containing a list of regular expressions and corresponding paragraph types.	String	No

Paragraph Sentence Splitter

Re-annotates Sentences based upon existing Paragraphs, preventing a Sentence from spanning more than one Paragraph.

Source class: ParagraphSentenceFixer
Source package: org.apache.ctakes.core.ae
Parent class: org.apache.uima.fit.component.JCasAnnotator_ImplBase
Dependencies: Paragraph, Sentence

No available configuration parameters.

Prose Sentence Detector

Sentence detector that uses B I O for determination. Useful for documents in which newlines may not indicate sentence boundaries.

Source class: SentenceDetectorAnnotatorBIO
Source package: org.apache.ctakes.core.ae
Parent class: org.cleartk.ml.CleartkAnnotator
Dependencies: Section
Products: Sentence

Parameter	Description	Class	Required	Default
classifierFactoryClassName	provides the full name of the ClassifierFactory class to be used.	String	No	org.cleartk.ml.jar. JarClassifierFactory
dataWriterFactoryClassName	provides the full name of the DataWriterFactory class to be used.	String	No	org.cleartk.ml.jar. DefaultDataWriterFactory
FeatureConfiguration		FEAT_CONFIG	No
isTraining	determines whether this annotator is writing training data or using a classifier to annotate. Normally inferred automatically based on whether or not a DataWriterFactory class has been set.	Boolean	No
TokenFilename		String	No

Regex Sectionizer

Annotates Document Sections by detecting Section Headers using Regular Expressions provided in a Bar-Separated-Value (BSV) File.

Source class: BsvRegexSectionizer
Source package: org.apache.ctakes.core.ae
Parent class: org.apache.ctakes.core.ae.RegexSectionizer
Products: Section

Parameter	Description	Class	Required	Default
SectionsBsv	path to a BSV file containing a list of regular expressions and corresponding section types.	String	Yes	org/apache/ctakes/core/sections/ DefaultSectionRegex.bsv
TagDividers	True if lines of divider characters ____ , ---- , === should divide sections	boolean	No	true

Sectionizer

Annotates Document Sections by detecting Section Headers in template.

Source class: SectionSegmentAnnotator
Source package: org.apache.ctakes.core.ae
Parent class: org.apache.uima.analysis_component.JCasAnnotator_ImplBase
Products: Section

No available configuration parameters.

Sentence Detector

Annotates Sentences based upon an OpenNLP model.

Source class: SentenceDetector
Source package: org.apache.ctakes.core.ae
Parent class: org.apache.uima.fit.component.JCasAnnotator_ImplBase
Dependencies: Section
Products: Sentence

Parameter	Description	Class	Required	Default
SentenceModelFile	Path to sentence detector model file	String	Yes	org/apache/ctakes/core/models/sentdetect/ sd-med-model.zip
SegmentsToSkip	Set of segments that can be skipped	String[]	No

Single Sectionizer

Annotates Document as a single Section.

Source class: SimpleSegmentAnnotator
Source package: org.apache.ctakes.core.ae
Parent class: org.apache.uima.fit.component.JCasAnnotator_ImplBase
Products: Section

Parameter	Description	Class	Required	Default
SegmentID	Name to give to all segments	String	No	SIMPLE_SEGMENT

Tag Sectionizer

Annotates Document Sections by detecting start and end Section Tags.

Source class: SimpleSegmentWithTagsAnnotator
Source package: org.apache.ctakes.core.ae
Parent class: org.apache.uima.analysis_component.JCasAnnotator_ImplBase
Products: Section

No available configuration parameters.

Thread Safe Sentence Detector

Annotates Sentences based upon an OpenNLP model.

Source class: ThreadSafeSentenceDetector
Source package: org.apache.ctakes.core.concurrent
Parent class: org.apache.ctakes.core.ae.SentenceDetector
Dependencies: Section
Products: Sentence

Parameter	Description	Class	Required	Default
SentenceModelFile	Path to sentence detector model file	String	Yes	org/apache/ctakes/core/models/sentdetect/ sd-med-model.zip
SegmentsToSkip	Set of segments that can be skipped	String[]	No

Thread Safe Sentence Detector BIO

Thread safe sentence detector that uses B I O for determination. Useful for documents in which newlines may not indicate sentence boundaries.

Source class: ThreadSafeSentenceDetectorBio
Source package: org.apache.ctakes.core.concurrent
Parent class: org.apache.ctakes.core.ae.SentenceDetectorAnnotatorBIO
Dependencies: Section
Products: Sentence

Parameter	Description	Class	Required	Default
classifierFactoryClassName	provides the full name of the ClassifierFactory class to be used.	String	No	org.cleartk.ml.jar. JarClassifierFactory
dataWriterFactoryClassName	provides the full name of the DataWriterFactory class to be used.	String	No	org.cleartk.ml.jar. DefaultDataWriterFactory
FeatureConfiguration		FEAT_CONFIG	No
isTraining	determines whether this annotator is writing training data or using a classifier to annotate. Normally inferred automatically based on whether or not a DataWriterFactory class has been set.	Boolean	No
TokenFilename		String	No

Tokenizer

Annotates Document Tokens.

Source class: TokenizerAnnotator
Source package: org.apache.ctakes.core.ae
Parent class: org.apache.uima.analysis_component.JCasAnnotator_ImplBase
Dependencies: Section
Products: Base Token

No available configuration parameters.

Output Writers

CUI Count Writer

Writes a two-column BSV file containing CUIs and their total counts in a document.

Source class: CuiCountFileWriter
Source package: org.apache.ctakes.core.cc
Parent class: org.apache.uima.fit.component.CasConsumer_ImplBase
Dependencies: Document Id, Identified Annotation

Parameter	Description	Class	Required	Default
OutputDirectory	Directory for all output files.	String	No

CUI List Writer

Writes a list of CUIs, covered text and preferred text to files.

Source class: CuiListFileWriter
Source package: org.apache.ctakes.core.cc
Parent class: org.apache.ctakes.core.cc.AbstractJCasFileWriter
Dependencies: Document Id, Sentence, Base Token
Usables: Document Id Prefix, Identified Annotation, Event, Timex, Temporal Relation

Parameter	Description	Class	Required	Default
OutputDirectory	Directory for all output files.	File	Yes
SubDirectory	SubDirectory for files.	String	No

Document Text Writer

Writes Text files with original text from the document.

Source class: FilesInDirectoryCasConsumer
Source package: org.apache.ctakes.core.cc
Parent class: org.apache.uima.collection.CasConsumer_ImplBase
Dependencies: Document Id

No available configuration parameters.

Document Text Writer (Dir)

Writes Text files with original text from the document in a specified directory.

Source class: NormalizedFilesInDirectoryCasConsumer
Source package: org.apache.ctakes.core.cc
Parent class: org.apache.uima.collection.CasConsumer_ImplBase
Dependencies: Document Id, Base Token

No available configuration parameters.

HTML Table Writer

Writes HTML files with a Table representation of extracted information.

Source class: HtmlTableCasConsumer
Source package: org.apache.ctakes.core.cc
Parent class: org.apache.uima.collection.CasConsumer_ImplBase
Dependencies: Base Token

No available configuration parameters.

HTML Writer

Writes html files with document text and simple markups (Semantic Group, CUI, Negation).

Source class: HtmlTextWriter
Source package: org.apache.ctakes.core.cc.html
Parent class: org.apache.ctakes.core.cc.AbstractJCasFileWriter
Dependencies: Document Id, Sentence, Base Token
Usables: Document Id Prefix, Identified Annotation, Event, Timex, Temporal Relation

Parameter	Description	Class	Required	Default
OutputDirectory	Directory for all output files.	File	Yes
SubDirectory	SubDirectory for files.	String	No

HTML Writer

Writes html files with document text and simple markups (Semantic Group, CUI, Negation).

Source class: HtmlTextWriter
Source package: org.apache.ctakes.core.cc.pretty.html
Parent class: org.apache.ctakes.core.cc.AbstractJCasFileWriter
Dependencies: Document Id, Sentence, Base Token
Usables: Document Id Prefix, Identified Annotation, Event, Timex, Temporal Relation

Parameter	Description	Class	Required	Default
OutputDirectory	Directory for all output files.	File	Yes
SubDirectory	SubDirectory for files.	String	No

I2b2JdbcWriter

Writes UMLS Concepts to a standard I2B2 Observation_Fact table.

Source class: I2b2JdbcWriter
Source package: org.apache.ctakes.core.cc.jdbc.i2b2
Parent class: org.apache.ctakes.core.cc.jdbc.AbstractJCasJdbcWriter
Dependencies: Identified Annotation

Parameter	Description	Class	Required
DbDriver	JDBC driver ClassName.	String	Yes
DbPass	Password for database authentication.	String	Yes
DbUrl	JDBC URL that specifies database network location and name.	String	Yes
DbUser	Username for database authentication.	String	Yes
FactOutputTable	Name of the Observation_Fact table for writing output.	String	Yes
BatchSize	Number of statements to use in a batch. 0 or 1 denotes that batches should not be used.	String	No
KeepAlive	Flag that determines whether to keep JDBC connection open no matter what.	String	No
RepeatCuis	Repeat Concepts with the same Cui but possibly different Semantic Type or Preferred Text.	boolean	No

JDBC Writer (Template)

Stores extracted information and document metadata in a database.

Source class: JdbcWriterTemplate
Source package: org.apache.ctakes.core.cc
Parent class: org.apache.ctakes.core.cc.AbstractJdbcWriter
Dependencies: Document Id, Identified Annotation

No available configuration parameters.

Medication Table Writer

Writes a table of Medication information to file, sorted by character index.

Source class: MedicationTableFileWriter
Source package: org.apache.ctakes.core.cc
Parent class: org.apache.ctakes.core.cc.AbstractTableFileWriter
Dependencies: Document Id, Identified Annotation
Usables: Document Id Prefix

Parameter	Description	Class	Required
OutputDirectory	Directory for all output files.	File	Yes
SubDirectory	SubDirectory for files.	String	No
TableType	Type of Table to write to File. Possible values are: BSV, CSV, HTML, TAB	String	No

Pretty Text Writer

Writes text files with document text and simple markups (POS, Semantic Group, CUI, Negation).

Source class: PrettyTextWriterFit
Source package: org.apache.ctakes.core.cc.pretty.plaintext
Parent class: org.apache.ctakes.core.cc.AbstractJCasFileWriter
Dependencies: Document Id, Sentence, Base Token
Usables: Document Id Prefix, Identified Annotation, Event, Timex, Temporal Relation

Parameter	Description	Class	Required	Default
OutputDirectory	Directory for all output files.	File	Yes
SubDirectory	SubDirectory for files.	String	No

Pretty Text Writer (UIMA)

Writes text files with document text and simple markups (POS, Semantic Group, CUI, Negation).

Source class: PrettyTextWriterUima
Source package: org.apache.ctakes.core.cc.pretty.plaintext
Parent class: org.apache.uima.collection.CasConsumer_ImplBase
Dependencies: Document Id, Sentence, Base Token
Usables: Identified Annotation, Event, Timex, Temporal Relation

No available configuration parameters.

Property Text Writer

Writes text files with lists of annotations and properties (POS, Semantic Group, CUI, Negation).

Source class: PropertyTextWriterFit
Source package: org.apache.ctakes.core.cc.property.plaintext
Parent class: org.apache.uima.fit.component.CasConsumer_ImplBase
Dependencies: Document Id, Sentence, Identified Annotation

Parameter	Description	Class	Required	Default
OutputDirectory	Directory for all output files.	String	No

Property Text Writer (UIMA)

Writes text files with lists of annotations and properties (POS, Semantic Group, CUI, Negation).

Source class: PropertyTextWriterUima
Source package: org.apache.ctakes.core.cc.property.plaintext
Parent class: org.apache.uima.collection.CasConsumer_ImplBase
Dependencies: Document Id, Sentence, Identified Annotation

No available configuration parameters.

Semantic Table Writer

Writes a table of Annotation information to file, grouped by Semantic Type.

Source class: SemanticTableFileWriter
Source package: org.apache.ctakes.core.cc
Parent class: org.apache.ctakes.core.cc.AbstractTableFileWriter
Dependencies: Document Id, Identified Annotation
Usables: Document Id Prefix

Parameter	Description	Class	Required
OutputDirectory	Directory for all output files.	File	Yes
SubDirectory	SubDirectory for files.	String	No
TableType	Type of Table to write to File. Possible values are: BSV, CSV, HTML, TAB	String	No

Sentences Writer

Writes Text files with original text from the document, sentence by sentence.

Source class: SentenceTokensPrinter
Source package: org.apache.ctakes.core.cc
Parent class: org.apache.uima.collection.CasConsumer_ImplBase
Dependencies: Document Id, Sentence, Base Token

No available configuration parameters.

Text Span Writer

Writes BSV files with original text for extracted annotations and their span offsets.

Source class: TextSpanWriter
Source package: org.apache.ctakes.core.cc
Parent class: org.apache.uima.fit.component.CasConsumer_ImplBase
Dependencies: Identified Annotation

Parameter	Description	Class	Required	Default
OutputDirectory	Directory for all output files.	String	No

Token Offset Writer

Writes a two-column BSV file containing Begin and End offsets of tokens in a document.

Source class: TokenOffsetsCasConsumer
Source package: org.apache.ctakes.core.cc
Parent class: org.apache.uima.collection.CasConsumer_ImplBase
Dependencies: Document Id, Base Token

No available configuration parameters.

Token Table Writer

Writes a table of base tokens and their spans in a directory tree.

Source class: TokenTableFileWriter
Source package: org.apache.ctakes.core.cc
Parent class: org.apache.ctakes.core.cc.AbstractTableFileWriter
Usables: Document Id Prefix, Base Token

Parameter	Description	Class	Required
OutputDirectory	Directory for all output files.	File	Yes
SubDirectory	SubDirectory for files.	String	No
TableType	Type of Table to write to File. Possible values are: BSV, CSV, HTML, TAB	String	No

Word Count Writer

Writes a two-column BSV file containing Words and their total counts in a document.

Source class: TokenFreqCasConsumer
Source package: org.apache.ctakes.core.cc
Parent class: org.apache.uima.collection.CasConsumer_ImplBase
Dependencies: Base Token

No available configuration parameters.

XMI Writer

Writes XMI files with full representation of input text and all extracted information.

Source class: XmiWriterCasConsumerCtakes
Source package: org.apache.ctakes.core.cc
Parent class: org.apache.uima.fit.component.CasConsumer_ImplBase
Dependencies: Document Id

Parameter	Description	Class	Required	Default
OutputDirectory	Output directory to write xmi files	File	Yes

XMI Writer (Dir Tree)

Writes XMI files with full representation of input text and all extracted information.

Source class: FileTreeXmiWriter
Source package: org.apache.ctakes.core.cc
Parent class: org.apache.ctakes.core.cc.AbstractJCasFileWriter
Dependencies: Document Id
Usables: Document Id Prefix

Parameter	Description	Class	Required	Default
OutputDirectory	Directory for all output files.	File	Yes
SubDirectory	SubDirectory for files.	String	No

XMI Writer 2

Writes XMI files with full representation of input text and all extracted information.

Source class: CasConsumer
Source package: org.apache.ctakes.core.cc
Parent class: org.apache.uima.collection.CasConsumer_ImplBase
Dependencies: Document Id

No available configuration parameters.

Utilities

Annotation Remover

Removes annotations of a given type from the JCas.

Source class: FilterAnnotator
Source package: org.apache.ctakes.core.ae
Parent class: org.apache.uima.analysis_component.JCasAnnotator_ImplBase
Dependencies: Base Token

No available configuration parameters.

CommandRunner

Runs an external process.

Source class: CommandRunner
Source package: org.apache.ctakes.core.ae
Parent class: org.apache.ctakes.core.ae.AbstractCommandRunner

Parameter	Description	Class	Required	Default
OutputDirectory	Directory for all output files.	File	Yes
Command	A full command line to be executed. Make sure to quote.	String	No
CommandDir	The Command Executable's directory.	String	No
Log	A name for the streaming logger. Default is the Command.	String	No
LogFile	File to which cTAKES output should be sent.	String	No
Pause	Pause for some seconds. Default is 0	int	No
PerDoc	yes to run the command once per document. Default is no.	String	No	no
SetJavaHome	Set JAVA_HOME to the Java running cTAKES. Default is yes.	String	No	yes
Wait	Wait for the process to finish. Default is no.	String	No	no
WorkingDir	The Working Directory directory.	String	No

CtakesRunner

Starts a new instance of cTAKES with the given piper parameters.

Source class: CtakesRunner
Source package: org.apache.ctakes.core.ae
Parent class: org.apache.ctakes.core.ae.PausableFileLoggerAE

Parameter	Description	Class	Required	Default
OutputDirectory	Directory for all output files.	File	Yes
Pipeline	Piper parameters. Make sure to quote.	String	Yes
LogFile	File to which cTAKES output should be sent.	String	No
Pause	Pause for some seconds. Default is 0	int	No
Wait	Wait for the process to finish. Default is no.	String	No	no

Deprecated Finished Logger

use FinishedLogger in (sub) package log.

Source class: FinishedLogger
Source package: org.apache.ctakes.core.util
Parent class: org.apache.uima.fit.component.JCasAnnotator_ImplBase

No available configuration parameters.

Document ID Printer

Logs the Document ID to Log4j and Standard Output.

Source class: DocumentIdPrinterAnalysisEngine
Source package: org.apache.ctakes.core.ae
Parent class: org.apache.uima.fit.component.JCasAnnotator_ImplBase
Dependencies: Document Id

No available configuration parameters.

ExitForcer

Forcibly Exits cTAKES. Use only at the end of a pipeline.

Source class: ExitForcer
Source package: org.apache.ctakes.core.ae
Parent class: org.apache.ctakes.core.ae.inert.PausableAE

Parameter	Description	Class	Required	Default
ForceExit	Forcibly exits the system when the value is yes. Yes by default.	String	No	yes
Pause	Pause for some seconds. Default is 0	int	No
Wait	Wait for the process to finish. Default is no.	String	No	no

Finished Logger

Writes a banner message COMPLETE to the log when all processing is finished.

Source class: FinishedLogger
Source package: org.apache.ctakes.core.util.log
Parent class: org.apache.uima.fit.component.JCasAnnotator_ImplBase

No available configuration parameters.

JCas Copy Annotator

Copies document text and all annotations into a new JCas.

Source class: CopyAnnotator
Source package: org.apache.ctakes.core.ae
Parent class: org.apache.uima.fit.component.JCasAnnotator_ImplBase

Parameter	Description	Class	Required	Default
dataBindMap	Mapping between source methods and destination methods in a bar ("	") separated format	String[]	Yes
destObjClass	Name of destination class	String	Yes
srcObjClass	Name of source class	String	Yes

Knowtator XML Reader (SHARP)

Reads annotations from SHARP schema Knowtator XML files in a directory.

Source class: SHARPKnowtatorXMLReader
Source package: org.apache.ctakes.core.ae
Parent class: org.apache.uima.fit.component.JCasAnnotator_ImplBase
Products: Identified Annotation, Event, Timex, Location Relation, Degree Relation, Temporal Relation

Parameter	Description	Class	Required	Default
SetDefaults	whether or not to set default attribute values if no annotation is present	boolean	Yes
TextDirectory	directory containing the text files (if DocumentIDs are just filenames); defaults to assuming that DocumentIDs are full file paths	File	No

MrsDrSentenceJoiner

Joins Sentences with person titles Mr. Mrs. Dr. that have been split by SentenceDetectorBIO.

Source class: MrsDrSentenceJoiner
Source package: org.apache.ctakes.core.ae
Parent class: org.apache.uima.fit.component.JCasAnnotator_ImplBase
Dependencies: Sentence

No available configuration parameters.

Null Annotator

Does absolutely nothing.

Source class: NullAnnotator
Source package: org.apache.ctakes.core.ae
Parent class: org.apache.uima.analysis_component.JCasAnnotator_ImplBase

No available configuration parameters.

Overlap Annotator

Removes or modifies annotations that overlap.

Source class: OverlapAnnotator
Source package: org.apache.ctakes.core.ae
Parent class: org.apache.uima.analysis_component.JCasAnnotator_ImplBase
Dependencies: Base Token

No available configuration parameters.

PatientNoteCollector

Caches each Document JCas in a Patient JCas as a View.

Source class: PatientNoteCollector
Source package: org.apache.ctakes.core.patient
Parent class: org.apache.uima.fit.component.JCasAnnotator_ImplBase

No available configuration parameters.

PiperFileRunEngine

Analysis Engine that executes the PiperFileRunner. Kludge for desc files (CPE).

Source class: PiperFileRunEngine
Source package: org.apache.ctakes.core.ae
Parent class: org.apache.uima.fit.component.JCasAnnotator_ImplBase

Parameter	Description	Class	Required	Default
PiperParams	Command Line Parameters normally used to run a piper file.	String	Yes

PythonPipper

Will pip a specified python package.

Source class: PythonPipper
Source package: org.apache.ctakes.core.ae
Parent class: org.apache.ctakes.core.ae.PythonRunner

Parameter	Description	Class	Required	Default
OutputDirectory	Directory for all output files.	File	Yes
PipPackage	Path of the python package to pip.	String	Yes
Command	A full command line to be executed. Make sure to quote.	String	No
CommandDir	The Command Executable's directory.	String	No
Log	A name for the streaming logger. Default is the Command.	String	No
LogFile	File to which cTAKES output should be sent.	String	No
Pause	Pause for some seconds. Default is 0	int	No
PerDoc	yes to run the command once per document. Default is no.	String	No	no
VirtualEnv	Path to Python virtual environment.	String	No
Wait	Wait for the process to finish. Default is no.	String	No	no
WorkingDir	The Working Directory directory.	String	No

PythonRunner

Starts a Python process with the given parameters.

Source class: PythonRunner
Source package: org.apache.ctakes.core.ae
Parent class: org.apache.ctakes.core.ae.AbstractCommandRunner

Parameter	Description	Class	Required	Default
OutputDirectory	Directory for all output files.	File	Yes
Command	A full command line to be executed. Make sure to quote.	String	No
CommandDir	The Command Executable's directory.	String	No
Log	A name for the streaming logger. Default is the Command.	String	No
LogFile	File to which cTAKES output should be sent.	String	No
Pause	Pause for some seconds. Default is 0	int	No
PerDoc	yes to run the command once per document. Default is no.	String	No	no
VirtualEnv	Path to Python virtual environment.	String	No
Wait	Wait for the process to finish. Default is no.	String	No	no
WorkingDir	The Working Directory directory.	String	No

Start or Finish Logger

Simple Annotator to place before and after other annotators that do not Log their Start and Finish.

Source class: StartFinishLogger
Source package: org.apache.ctakes.core.ae
Parent class: org.apache.uima.fit.component.JCasAnnotator_ImplBase

Parameter	Description	Class	Required	Default
LOGGER_NAME	provides the full name of the Annotator Engine for which start / end logging should be done.	String	Yes	StartEndProgressLogger
IS_START	indicates whether this should log a start.	Boolean	No
LOGGER_TASK	provides the descriptive purpose of the Annotator Engine for which start / end logging should be done.	String	No	Processing ...

Piper Files

Default Tokenizer Pipeline

Commands and parameters for a small tokenization pipeline.

Default Tokenizer Pipeline

$\textcolor{gray}{\textsf{// Commands and parameters for a small tokenization pipeline. }}$

$\textcolor{green}{\textbf{add}}$ SimpleSegmentAnnotator
$\textcolor{green}{\textbf{add}}$ SentenceDetector
$\textcolor{green}{\textbf{add}}$ TokenizerAnnotatorPTB

Full Tokenizer Pipeline

Commands and parameters for a small tokenization pipeline with sections, paragraphs and lists.

Full Tokenizer Pipeline

$\textcolor{gray}{\textsf{// Commands and parameters for a small tokenization pipeline with sections, paragraphs and lists. }}$

$\textcolor{gray}{\textsf{// Annotate sections by known regex }}$
$\textcolor{green}{\textbf{add}}$ BsvRegexSectionizer

$\textcolor{gray}{\textsf{// The sentence detector needs our custom model path, otherwise default values are used. }}$
$\textcolor{gray}{\textsf{//add SentenceDetectorAnnotatorBIO classifierJarPath=/org/apache/ctakes/core/models/sentdetect/model.jar }}$

$\textcolor{gray}{\textsf{// The SentenceDetectorAnnotatorBIO is a "lumper" that works well for notes in which end of line does not indicate a sentence. }}$
$\textcolor{gray}{\textsf{// If that is not your case, then you may get better results using the more standard SentenceDetector }}$
$\textcolor{green}{\textbf{add}}$ SentenceDetector

$\textcolor{gray}{\textsf{// By default, paragraphs are parsed using empty lines as separators and Part \#: }}$
$\textcolor{green}{\textbf{add}}$ ParagraphAnnotator
$\textcolor{gray}{\textsf{// Fix sentences so that no sentence spans across two or more paragraphs. }}$
$\textcolor{green}{\textbf{add}}$ ParagraphSentenceFixer

$\textcolor{gray}{\textsf{// Use regular expressions created for the Pitt notes to discover formatted lists and tables. }}$
$\textcolor{green}{\textbf{add}}$ ListAnnotator
$\textcolor{gray}{\textsf{// Fix sentences so that no sentence spans across two or more list entries. }}$
$\textcolor{green}{\textbf{add}}$ ListSentenceFixer

$\textcolor{gray}{\textsf{// Now we can finally tokenize, tag parts of speech and chunk using adjusted sentences. }}$
$\textcolor{green}{\textbf{add}}$ TokenizerAnnotatorPTB

Ts Default Tokenizer Pipeline

Commands and parameters for a small thread-safe tokenization pipeline.

Ts Default Tokenizer Pipeline

$\textcolor{gray}{\textsf{// Commands and parameters for a small thread-safe tokenization pipeline. }}$

$\textcolor{green}{\textbf{add}}$ SimpleSegmentAnnotator
$\textcolor{green}{\textbf{add}}$ $\textcolor{blue}{\textsf{concurrent.ThreadSafeSentenceDetector}}$
$\textcolor{green}{\textbf{add}}$ TokenizerAnnotatorPTB

Ts Full Tokenizer Pipeline

Commands and parameters for a small thread-safe tokenization pipeline with sections, paragraphs and lists.

Ts Full Tokenizer Pipeline

$\textcolor{gray}{\textsf{// Commands and parameters for a small thread-safe tokenization pipeline with sections, paragraphs and lists. }}$

$\textcolor{gray}{\textsf{// Annotate sections by known regex }}$
$\textcolor{green}{\textbf{add}}$ BsvRegexSectionizer

$\textcolor{gray}{\textsf{// The sentence detector needs our custom model path, otherwise default values are used. }}$
$\textcolor{gray}{\textsf{//add concurrent.ThreadSafeSentenceDetectorBio classifierJarPath=/org/apache/ctakes/core/models/sentdetect/model.jar }}$

$\textcolor{gray}{\textsf{// The SentenceDetectorAnnotatorBIO is a "lumper" that works well for notes in which end of line does not indicate a sentence. }}$
$\textcolor{gray}{\textsf{// If that is not your case, then you may get better results using the more standard SentenceDetector }}$
$\textcolor{green}{\textbf{add}}$ $\textcolor{blue}{\textsf{concurrent.ThreadSafeSentenceDetector}}$

$\textcolor{gray}{\textsf{// By default, paragraphs are parsed using empty lines as separators and Part \#: }}$
$\textcolor{green}{\textbf{add}}$ ParagraphAnnotator
$\textcolor{gray}{\textsf{// Fix sentences so that no sentence spans across two or more paragraphs. }}$
$\textcolor{green}{\textbf{add}}$ ParagraphSentenceFixer

$\textcolor{gray}{\textsf{// Use regular expressions created for the Pitt notes to discover formatted lists and tables. }}$
$\textcolor{green}{\textbf{add}}$ ListAnnotator
$\textcolor{gray}{\textsf{// Fix sentences so that no sentence spans across two or more list entries. }}$
$\textcolor{green}{\textbf{add}}$ ListSentenceFixer

$\textcolor{gray}{\textsf{// Now we can finally tokenize, tag parts of speech and chunk using adjusted sentences. }}$
$\textcolor{green}{\textbf{add}}$ TokenizerAnnotatorPTB