ctakes dictionary lookup fast - apache/ctakes GitHub Wiki

The fast dictionary lookup annotator identifies terms in text and normalizes them to codes in an ontology: UMLS CUI, Snomed-CT, RxNorm, etc. The fast dictionary lookup module comes with multiple possible pre-packaged configurations and is also customizable and extendable.

Process Overview The Fast Dictionary Lookup module has six basic processes performed by three components, as well as a parser that can configure the actual Dictionaries.
  • A Parse Dictionary Descriptor file
  • B Create Dictionaries and Concept Factories
  1. Get Lookup Windows from CAS
  2. For each Lookup window, get candidate Lookup Tokens
  3. For each Lookup Token, get matches in Dictionary Index
  4. For each Token match, check Lookup Window for Full Text match
  5. For each Full Text match, create Concepts
  6. Store appropriate Concepts in CAS as Annotations

Structure Diagram

Configuration There are options available to change the type of term matching used as well as the persistence of terms. Changes in configuration are made in two places:
  1. The main descriptor ...-fast/desc/analysis_engine/UmlsLookupAnnotator.xml
  2. The resource (dictionary) configuration file resources/.../dictionary/lookup/fast/sno_rx_16ab.xml (The file name might be different if you created your own custom dictionary)
Text Exact Match Because the UMLS dictionary contains rows with different combinations of lexical elements per term, using a direct string match of text in note to text of term is a valid candidate for term matching. This is different from the complex mechanism in the current (first word) lookup, and makes for simpler code and greater accuracy. This precise specification (and improved lookup speed) enables the use of an entire sentence as a lookup window rather than just a noun phrase. Usage of Sentence as a lookup window allows all possible tokens to be used for not only lookup keys, but also for term matching. For proper accuracy, custom dictionaries should also contain multiple entries for variations of term syntax. Note that term matching is attempted using the actual text in the note and also per-token cTAKES-generated lexical variants of the text in the note. This is the behavior of the ```DefaultJCasTermAnnotator``` class, which is the one used in the ```UmlsLookupAnnotator.xml``` descriptor.
Text Overlap Match To better approximate the original lookup annotator, one lookup method finds overlapping terms in addition to exact matching terms. This allows matches on discontiguous spans. For instance, for the text “blood, urine test” the exact match will find only one procedure: “urine test”. The overlap match will find both “urine test” and “blood test”. This is the behavior of the OverlapJCasTermAnnotator class, which is the one used in the ```UmlsOverlapLookupAnnotator.xml``` descriptor.
All Terms Persistence All terms discovered by the matchers can be stored in the CAS by a consumer, regardless of any property of the term. This means that for the text “lung cancer” the specific disease term “lung cancer” and broader term “cancer”. This can be useful for future searches on general concepts, e.g. searching via the CUI for “cancer” and getting all instances of “cancer” found in texts “lung cancer”, “skin cancer”, “stomach cancer”, etc. This is the behavior of the ```DefaultTermConsumer``` class.
Most Precise Terms Persistence Matched terms can be stored only by the longest overlapping span discovered for a semantic group. This keeps, for instance, the disease “lung cancer” but not “cancer”. Using semantic groups means that both the disease “lung cancer” and the anatomical site “lung” are persisted even though the spans overlap. When using the overlap matching method, any discontiguous spans are accounted for. So, for “blood, urine test” both the discontiguous spanned term “blood test” and the contiguous spanned term “urine test” are valid. To persist only the most precise terms, edit the xml configuration file for your dictionary (default is ```sno_rx_16ab.xml```), specifically within the section rareWordConsumer change the selected implementation. By default it is ```DefaultTermConsumer```, but you will want to use the commented-out ```PrecisionTermConsumer```.
Dictionary Stores The default configuration uses a dictionary that contains a subset of the UMLS in an hsql database. Custom dictionaries can be added using another hsql database, or using a bar-separated value (BSV) (a.k.a. pipe-separated) flat file. If you use a BSV file you do not need to tokenize the terms. Tokenization will be done automatically at runtime.
Lookup Window By default the new lookup uses Sentence as the lookup window. The primary reasons for this are:
  1. Not all terms are within Noun Phrases
  2. Some Noun Phrases overlapped, causing repeated lookups (in my 3.0 candidate trials)
  3. Not all cTakes Noun Phrases are accurate.

Because the lookup is fast, using a full Sentence for lookup doesn't seem to hurt much. However, you can always switch it back to see if precision is increased enough to warrant the decrease in recall. This is changed in UmlsLookupAnnotator.xml.

Annotation Engines
Piper Files


Annotation Engines

CasedAnnotationFinder

Finds all-uppercase or normal terms in text.

Source class: CasedAnnotationFinder
Source package: org.apache.ctakes.dictionary.cased.ae
Parent class: org.apache.uima.fit.component.JCasAnnotator_ImplBase
Dependencies: Base Token, Sentence
Products: Identified Annotation

Parameter Description Class Required Default
dictionaries Dictionaries to use for lookup. String[] Yes
encoders Term Encoders with schemas and schema codes. String[] Yes
allowWordSkips Terms may include words that do not match. So-called loose matching. String No
consecutiveSkips Number of consecutive non-comma tokens that can be skipped. int No
lookupAdjectives Use Adjective parts of speech for lookup. String No
lookupAdverbs Use Adverb parts of speech for lookup. String No
lookupNouns Use Noun parts of speech for lookup. String No
lookupVerbs Use Verb parts of speech for lookup. String No
minimumSpan Minimum number of characters for a term. int No
otherLookups List of other parts of speech for lookup. String[] No
reassignSemantics Reassign Semantic Types (TUIs) to non-default Semantic Groups. String[] No
subsume Subsume contained terms of the same semantic group. String No yes
subsumeSemantics Subsume contained terms of the same and certain other semantic groups. String No yes
totalSkips Number of total tokens that can be skipped. int No

Dictionary Lookup (Default)

Annotates clinically-relevant terms. Terms must match dictionary entries exactly.

Source class: DefaultJCasTermAnnotator
Source package: org.apache.ctakes.dictionary.lookup2.ae
Parent class: org.apache.ctakes.dictionary.lookup2.ae.AbstractJCasTermAnnotator
Dependencies: Sentence, Base Token
Products: Identified Annotation

Parameter Description Class Required Default
DictionaryDescriptor Path to Dictionary spec xml String No org/apache/ctakes/dictionary/lookup/fast/ sno_rx_16ab.xml
exclusionTags Set of exclusion POS tags String No VB,VBD,VBG,VBN,VBP,VBZ,CC,CD,DT,EX,IN,LS,MD,PDT,POS,PP,PP$,PRP,PRP$,RP,TO,WDT,WP,WPS,WRB
LookupXml Path to the xml file containing information for dictionary lookup configuration. String No org/apache/ctakes/dictionary/lookup/fast/ sno_rx_16ab.xml
minimumSpan Minimum number of characters for a term int No
windowAnnotations Type of Lookup window to use String No org.apache.ctakes.typesystem.type.textspan. Sentence

Dictionary Lookup (Overlap)

Annotates clinically-relevant terms. Terms can overlap dictionary entries.

Source class: OverlapJCasTermAnnotator
Source package: org.apache.ctakes.dictionary.lookup2.ae
Parent class: org.apache.ctakes.dictionary.lookup2.ae.AbstractJCasTermAnnotator
Dependencies: Sentence, Base Token
Products: Identified Annotation

Parameter Description Class Required Default
consecutiveSkips Number of consecutive non-comma tokens that can be skipped int No
DictionaryDescriptor Path to Dictionary spec xml String No org/apache/ctakes/dictionary/lookup/fast/ sno_rx_16ab.xml
exclusionTags Set of exclusion POS tags String No VB,VBD,VBG,VBN,VBP,VBZ,CC,CD,DT,EX,IN,LS,MD,PDT,POS,PP,PP$,PRP,PRP$,RP,TO,WDT,WP,WPS,WRB
LookupXml Path to the xml file containing information for dictionary lookup configuration. String No org/apache/ctakes/dictionary/lookup/fast/ sno_rx_16ab.xml
minimumSpan Minimum number of characters for a term int No
totalTokenSkips Number of total tokens that can be skipped int No
windowAnnotations Type of Lookup window to use String No org.apache.ctakes.typesystem.type.textspan. Sentence

Thread safe Dictionary Lookup (Default)

Annotates clinically-relevant terms. Terms must match dictionary entries exactly.

Source class: ThreadSafeFastLookup
Source package: org.apache.ctakes.dictionary.lookup2.concurrent
Parent class: org.apache.ctakes.dictionary.lookup2.ae.DefaultJCasTermAnnotator
Dependencies: Sentence, Base Token
Products: Identified Annotation

Parameter Description Class Required Default
DictionaryDescriptor Path to Dictionary spec xml String No org/apache/ctakes/dictionary/lookup/fast/ sno_rx_16ab.xml
exclusionTags Set of exclusion POS tags String No VB,VBD,VBG,VBN,VBP,VBZ,CC,CD,DT,EX,IN,LS,MD,PDT,POS,PP,PP$,PRP,PRP$,RP,TO,WDT,WP,WPS,WRB
LookupXml Path to the xml file containing information for dictionary lookup configuration. String No org/apache/ctakes/dictionary/lookup/fast/ sno_rx_16ab.xml
minimumSpan Minimum number of characters for a term int No
windowAnnotations Type of Lookup window to use String No org.apache.ctakes.typesystem.type.textspan. Sentence

Piper Files

Dictionary Sub Pipe

Commands and parameters to create a dictionary lookup sub-pipeline.

Dictionary Sub Pipe

$\textcolor{gray}{\textsf{// Commands and parameters to create a dictionary lookup sub-pipeline. }}$
$\textcolor{gray}{\textsf{// This is not a full pipeline. }}$

$\textcolor{gray}{\textsf{// path to the xml file containing information for dictionary lookup configuration. }}$
$\textcolor{brown}{\textbf{cli}}$ $\textcolor{purple}{\textbf{LookupXml}}$= $\textcolor{violet}{\textsf{l}}$
$\textcolor{gray}{\textsf{// umls credentials }}$
$\textcolor{brown}{\textbf{cli}}$ $\textcolor{purple}{\textbf{umlsKey}}$= $\textcolor{violet}{\textsf{key}}$

$\textcolor{gray}{\textsf{// Default fast dictionary lookup }}$
$\textcolor{green}{\textbf{add}}$ DefaultJCasTermAnnotator

Ts Dictionary Sub Pipe

Commands and parameters to create a default dictionary lookup sub-pipeline.

Ts Dictionary Sub Pipe

$\textcolor{gray}{\textsf{// Commands and parameters to create a default dictionary lookup sub-pipeline. }}$
$\textcolor{gray}{\textsf{// This is not a full pipeline. }}$

$\textcolor{gray}{\textsf{// path to the xml file containing information for dictionary lookup configuration. }}$
$\textcolor{brown}{\textbf{cli}}$ $\textcolor{purple}{\textbf{LookupXml}}$= $\textcolor{violet}{\textsf{l}}$
$\textcolor{gray}{\textsf{// umls credentials }}$
$\textcolor{brown}{\textbf{cli}}$ $\textcolor{purple}{\textbf{umlsKey}}$= $\textcolor{violet}{\textsf{key}}$

$\textcolor{gray}{\textsf{// Default fast dictionary lookup }}$
$\textcolor{green}{\textbf{add}}$ $\textcolor{blue}{\textsf{concurrent.ThreadSafeFastLookup}}$

⚠️ **GitHub.com Fallback** ⚠️