ctakes dictionary lookup fast - apache/ctakes GitHub Wiki

The fast dictionary lookup annotator identifies terms in text and normalizes them to codes in an ontology: UMLS CUI, Snomed-CT, RxNorm, etc. The fast dictionary lookup module comes with multiple possible pre-packaged configurations and is also customizable and extendable.

Process Overview

The Fast Dictionary Lookup module has six basic processes performed by three components, as well as a parser that can configure the actual Dictionaries.

A Parse Dictionary Descriptor file
B Create Dictionaries and Concept Factories

Get Lookup Windows from CAS
For each Lookup window, get candidate Lookup Tokens
For each Lookup Token, get matches in Dictionary Index
For each Token match, check Lookup Window for Full Text match
For each Full Text match, create Concepts
Store appropriate Concepts in CAS as Annotations

Structure Diagram

Configuration

There are options available to change the type of term matching used as well as the persistence of terms. Changes in configuration are made in two places:

The main descriptor ...-fast/desc/analysis_engine/UmlsLookupAnnotator.xml
The resource (dictionary) configuration file resources/.../dictionary/lookup/fast/sno_rx_16ab.xml (The file name might be different if you created your own custom dictionary)

Text Exact Match

Because the UMLS dictionary contains rows with different combinations of lexical elements per term, using a direct string match of text in note to text of term is a valid candidate for term matching. This is different from the complex mechanism in the current (first word) lookup, and makes for simpler code and greater accuracy. This precise specification (and improved lookup speed) enables the use of an entire sentence as a lookup window rather than just a noun phrase. Usage of Sentence as a lookup window allows all possible tokens to be used for not only lookup keys, but also for term matching. For proper accuracy, custom dictionaries should also contain multiple entries for variations of term syntax. Note that term matching is attempted using the actual text in the note and also per-token cTAKES-generated lexical variants of the text in the note. This is the behavior of the ```DefaultJCasTermAnnotator``` class, which is the one used in the ```UmlsLookupAnnotator.xml``` descriptor.

Text Overlap Match

To better approximate the original lookup annotator, one lookup method finds overlapping terms in addition to exact matching terms. This allows matches on discontiguous spans. For instance, for the text “blood, urine test” the exact match will find only one procedure: “urine test”. The overlap match will find both “urine test” and “blood test”. This is the behavior of the OverlapJCasTermAnnotator class, which is the one used in the ```UmlsOverlapLookupAnnotator.xml``` descriptor.

All Terms Persistence

All terms discovered by the matchers can be stored in the CAS by a consumer, regardless of any property of the term. This means that for the text “lung cancer” the specific disease term “lung cancer” and broader term “cancer”. This can be useful for future searches on general concepts, e.g. searching via the CUI for “cancer” and getting all instances of “cancer” found in texts “lung cancer”, “skin cancer”, “stomach cancer”, etc. This is the behavior of the ```DefaultTermConsumer``` class.

Most Precise Terms Persistence

Matched terms can be stored only by the longest overlapping span discovered for a semantic group. This keeps, for instance, the disease “lung cancer” but not “cancer”. Using semantic groups means that both the disease “lung cancer” and the anatomical site “lung” are persisted even though the spans overlap. When using the overlap matching method, any discontiguous spans are accounted for. So, for “blood, urine test” both the discontiguous spanned term “blood test” and the contiguous spanned term “urine test” are valid. To persist only the most precise terms, edit the xml configuration file for your dictionary (default is ```sno_rx_16ab.xml```), specifically within the section rareWordConsumer change the selected implementation. By default it is ```DefaultTermConsumer```, but you will want to use the commented-out ```PrecisionTermConsumer```.

Dictionary Stores

The default configuration uses a dictionary that contains a subset of the UMLS in an hsql database. Custom dictionaries can be added using another hsql database, or using a bar-separated value (BSV) (a.k.a. pipe-separated) flat file. If you use a BSV file you do not need to tokenize the terms. Tokenization will be done automatically at runtime.

Lookup Window

By default the new lookup uses Sentence as the lookup window. The primary reasons for this are:

Not all terms are within Noun Phrases
Some Noun Phrases overlapped, causing repeated lookups (in my 3.0 candidate trials)
Not all cTakes Noun Phrases are accurate.

Because the lookup is fast, using a full Sentence for lookup doesn't seem to hurt much. However, you can always switch it back to see if precision is increased enough to warrant the decrease in recall. This is changed in UmlsLookupAnnotator.xml.

Annotation Engines
Piper Files

Annotation Engines

CasedAnnotationFinder

Finds all-uppercase or normal terms in text.

Source class: CasedAnnotationFinder
Source package: org.apache.ctakes.dictionary.cased.ae
Parent class: org.apache.uima.fit.component.JCasAnnotator_ImplBase
Dependencies: Base Token, Sentence
Products: Identified Annotation

Parameter	Description	Class	Required	Default
dictionaries	Dictionaries to use for lookup.	String[]	Yes
encoders	Term Encoders with schemas and schema codes.	String[]	Yes
allowWordSkips	Terms may include words that do not match. So-called loose matching.	String	No
consecutiveSkips	Number of consecutive non-comma tokens that can be skipped.	int	No
lookupAdjectives	Use Adjective parts of speech for lookup.	String	No
lookupAdverbs	Use Adverb parts of speech for lookup.	String	No
lookupNouns	Use Noun parts of speech for lookup.	String	No
lookupVerbs	Use Verb parts of speech for lookup.	String	No
minimumSpan	Minimum number of characters for a term.	int	No
otherLookups	List of other parts of speech for lookup.	String[]	No
reassignSemantics	Reassign Semantic Types (TUIs) to non-default Semantic Groups.	String[]	No
subsume	Subsume contained terms of the same semantic group.	String	No	yes
subsumeSemantics	Subsume contained terms of the same and certain other semantic groups.	String	No	yes
totalSkips	Number of total tokens that can be skipped.	int	No

Dictionary Lookup (Default)

Annotates clinically-relevant terms. Terms must match dictionary entries exactly.

Source class: DefaultJCasTermAnnotator
Source package: org.apache.ctakes.dictionary.lookup2.ae
Parent class: org.apache.ctakes.dictionary.lookup2.ae.AbstractJCasTermAnnotator
Dependencies: Sentence, Base Token
Products: Identified Annotation

Parameter	Description	Class	Required	Default
DictionaryDescriptor	Path to Dictionary spec xml	String	No	org/apache/ctakes/dictionary/lookup/fast/ sno_rx_16ab.xml
exclusionTags	Set of exclusion POS tags	String	No	VB,VBD,VBG,VBN,VBP,VBZ,CC,CD,DT,EX,IN,LS,MD,PDT,POS,PP,PP$,PRP,PRP$,RP,TO,WDT,WP,WPS,WRB
LookupXml	Path to the xml file containing information for dictionary lookup configuration.	String	No	org/apache/ctakes/dictionary/lookup/fast/ sno_rx_16ab.xml
minimumSpan	Minimum number of characters for a term	int	No
windowAnnotations	Type of Lookup window to use	String	No	org.apache.ctakes.typesystem.type.textspan. Sentence

Dictionary Lookup (Overlap)

Annotates clinically-relevant terms. Terms can overlap dictionary entries.

Source class: OverlapJCasTermAnnotator
Source package: org.apache.ctakes.dictionary.lookup2.ae
Parent class: org.apache.ctakes.dictionary.lookup2.ae.AbstractJCasTermAnnotator
Dependencies: Sentence, Base Token
Products: Identified Annotation

Parameter	Description	Class	Required	Default
consecutiveSkips	Number of consecutive non-comma tokens that can be skipped	int	No
DictionaryDescriptor	Path to Dictionary spec xml	String	No	org/apache/ctakes/dictionary/lookup/fast/ sno_rx_16ab.xml
exclusionTags	Set of exclusion POS tags	String	No	VB,VBD,VBG,VBN,VBP,VBZ,CC,CD,DT,EX,IN,LS,MD,PDT,POS,PP,PP$,PRP,PRP$,RP,TO,WDT,WP,WPS,WRB
LookupXml	Path to the xml file containing information for dictionary lookup configuration.	String	No	org/apache/ctakes/dictionary/lookup/fast/ sno_rx_16ab.xml
minimumSpan	Minimum number of characters for a term	int	No
totalTokenSkips	Number of total tokens that can be skipped	int	No
windowAnnotations	Type of Lookup window to use	String	No	org.apache.ctakes.typesystem.type.textspan. Sentence

Thread safe Dictionary Lookup (Default)

Annotates clinically-relevant terms. Terms must match dictionary entries exactly.

Source class: ThreadSafeFastLookup
Source package: org.apache.ctakes.dictionary.lookup2.concurrent
Parent class: org.apache.ctakes.dictionary.lookup2.ae.DefaultJCasTermAnnotator
Dependencies: Sentence, Base Token
Products: Identified Annotation

Parameter	Description	Class	Required	Default
DictionaryDescriptor	Path to Dictionary spec xml	String	No	org/apache/ctakes/dictionary/lookup/fast/ sno_rx_16ab.xml
exclusionTags	Set of exclusion POS tags	String	No	VB,VBD,VBG,VBN,VBP,VBZ,CC,CD,DT,EX,IN,LS,MD,PDT,POS,PP,PP$,PRP,PRP$,RP,TO,WDT,WP,WPS,WRB
LookupXml	Path to the xml file containing information for dictionary lookup configuration.	String	No	org/apache/ctakes/dictionary/lookup/fast/ sno_rx_16ab.xml
minimumSpan	Minimum number of characters for a term	int	No
windowAnnotations	Type of Lookup window to use	String	No	org.apache.ctakes.typesystem.type.textspan. Sentence

Piper Files

Dictionary Sub Pipe

Commands and parameters to create a dictionary lookup sub-pipeline.

Dictionary Sub Pipe

$\textcolor{gray}{\textsf{// Commands and parameters to create a dictionary lookup sub-pipeline. }}$
$\textcolor{gray}{\textsf{// This is not a full pipeline. }}$

$\textcolor{gray}{\textsf{// path to the xml file containing information for dictionary lookup configuration. }}$
$\textcolor{brown}{\textbf{cli}}$ $\textcolor{purple}{\textbf{LookupXml}}$= $\textcolor{violet}{\textsf{l}}$
$\textcolor{gray}{\textsf{// umls credentials }}$
$\textcolor{brown}{\textbf{cli}}$ $\textcolor{purple}{\textbf{umlsKey}}$= $\textcolor{violet}{\textsf{key}}$

$\textcolor{gray}{\textsf{// Default fast dictionary lookup }}$
$\textcolor{green}{\textbf{add}}$ DefaultJCasTermAnnotator

Ts Dictionary Sub Pipe

Commands and parameters to create a default dictionary lookup sub-pipeline.

Ts Dictionary Sub Pipe

$\textcolor{gray}{\textsf{// Commands and parameters to create a default dictionary lookup sub-pipeline. }}$
$\textcolor{gray}{\textsf{// This is not a full pipeline. }}$

$\textcolor{gray}{\textsf{// path to the xml file containing information for dictionary lookup configuration. }}$
$\textcolor{brown}{\textbf{cli}}$ $\textcolor{purple}{\textbf{LookupXml}}$= $\textcolor{violet}{\textsf{l}}$
$\textcolor{gray}{\textsf{// umls credentials }}$
$\textcolor{brown}{\textbf{cli}}$ $\textcolor{purple}{\textbf{umlsKey}}$= $\textcolor{violet}{\textsf{key}}$

$\textcolor{gray}{\textsf{// Default fast dictionary lookup }}$
$\textcolor{green}{\textbf{add}}$ $\textcolor{blue}{\textsf{concurrent.ThreadSafeFastLookup}}$

ctakes dictionary lookup fast - apache/ctakes GitHub Wiki

Annotation Engines

CasedAnnotationFinder

Dictionary Lookup (Default)

Dictionary Lookup (Overlap)

Thread safe Dictionary Lookup (Default)

Piper Files

Dictionary Sub Pipe

Ts Dictionary Sub Pipe

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️