ctakes dictionary lookup fast - apache/ctakes GitHub Wiki
The fast dictionary lookup annotator identifies terms in text and normalizes them to codes in an ontology: UMLS CUI, Snomed-CT, RxNorm, etc. The fast dictionary lookup module comes with multiple possible pre-packaged configurations and is also customizable and extendable.
Process Overview
The Fast Dictionary Lookup module has six basic processes performed by three components, as well as a parser that can configure the actual Dictionaries.- A Parse Dictionary Descriptor file
- B Create Dictionaries and Concept Factories
- Get Lookup Windows from CAS
- For each Lookup window, get candidate Lookup Tokens
- For each Lookup Token, get matches in Dictionary Index
- For each Token match, check Lookup Window for Full Text match
- For each Full Text match, create Concepts
- Store appropriate Concepts in CAS as Annotations
Configuration
There are options available to change the type of term matching used as well as the persistence of terms. Changes in configuration are made in two places:- The main descriptor ...
-fast/desc/analysis_engine/UmlsLookupAnnotator.xml
- The resource (dictionary) configuration file
resources/.../dictionary/lookup/fast/sno_rx_16ab.xml
(The file name might be different if you created your own custom dictionary)
Text Exact Match
Because the UMLS dictionary contains rows with different combinations of lexical elements per term, using a direct string match of text in note to text of term is a valid candidate for term matching. This is different from the complex mechanism in the current (first word) lookup, and makes for simpler code and greater accuracy. This precise specification (and improved lookup speed) enables the use of an entire sentence as a lookup window rather than just a noun phrase. Usage of Sentence as a lookup window allows all possible tokens to be used for not only lookup keys, but also for term matching. For proper accuracy, custom dictionaries should also contain multiple entries for variations of term syntax. Note that term matching is attempted using the actual text in the note and also per-token cTAKES-generated lexical variants of the text in the note. This is the behavior of the ```DefaultJCasTermAnnotator``` class, which is the one used in the ```UmlsLookupAnnotator.xml``` descriptor.Text Overlap Match
To better approximate the original lookup annotator, one lookup method finds overlapping terms in addition to exact matching terms. This allows matches on discontiguous spans. For instance, for the text “blood, urine test” the exact match will find only one procedure: “urine test”. The overlap match will find both “urine test” and “blood test”. This is the behavior of the OverlapJCasTermAnnotator class, which is the one used in the ```UmlsOverlapLookupAnnotator.xml``` descriptor.All Terms Persistence
All terms discovered by the matchers can be stored in the CAS by a consumer, regardless of any property of the term. This means that for the text “lung cancer” the specific disease term “lung cancer” and broader term “cancer”. This can be useful for future searches on general concepts, e.g. searching via the CUI for “cancer” and getting all instances of “cancer” found in texts “lung cancer”, “skin cancer”, “stomach cancer”, etc. This is the behavior of the ```DefaultTermConsumer``` class.Most Precise Terms Persistence
Matched terms can be stored only by the longest overlapping span discovered for a semantic group. This keeps, for instance, the disease “lung cancer” but not “cancer”. Using semantic groups means that both the disease “lung cancer” and the anatomical site “lung” are persisted even though the spans overlap. When using the overlap matching method, any discontiguous spans are accounted for. So, for “blood, urine test” both the discontiguous spanned term “blood test” and the contiguous spanned term “urine test” are valid. To persist only the most precise terms, edit the xml configuration file for your dictionary (default is ```sno_rx_16ab.xml```), specifically within the section rareWordConsumer change the selected implementation. By default it is ```DefaultTermConsumer```, but you will want to use the commented-out ```PrecisionTermConsumer```.Dictionary Stores
The default configuration uses a dictionary that contains a subset of the UMLS in an hsql database. Custom dictionaries can be added using another hsql database, or using a bar-separated value (BSV) (a.k.a. pipe-separated) flat file. If you use a BSV file you do not need to tokenize the terms. Tokenization will be done automatically at runtime.Lookup Window
By default the new lookup uses Sentence as the lookup window. The primary reasons for this are:- Not all terms are within Noun Phrases
- Some Noun Phrases overlapped, causing repeated lookups (in my 3.0 candidate trials)
- Not all cTakes Noun Phrases are accurate.
Because the lookup is fast, using a full Sentence for lookup doesn't seem to hurt much.
However, you can always switch it back to see if precision is increased enough to warrant the decrease in recall.
This is changed in UmlsLookupAnnotator.xml
.
Annotation Engines
Piper Files
Finds all-uppercase or normal terms in text.
Source class: CasedAnnotationFinder
Source package: org.apache.ctakes.dictionary.cased.ae
Parent class: org.apache.uima.fit.component.JCasAnnotator_ImplBase
Dependencies: Base Token, Sentence
Products: Identified Annotation
Parameter | Description | Class | Required | Default |
---|---|---|---|---|
dictionaries | Dictionaries to use for lookup. | String[] | Yes | |
encoders | Term Encoders with schemas and schema codes. | String[] | Yes | |
allowWordSkips | Terms may include words that do not match. So-called loose matching. | String | No | |
consecutiveSkips | Number of consecutive non-comma tokens that can be skipped. | int | No | |
lookupAdjectives | Use Adjective parts of speech for lookup. | String | No | |
lookupAdverbs | Use Adverb parts of speech for lookup. | String | No | |
lookupNouns | Use Noun parts of speech for lookup. | String | No | |
lookupVerbs | Use Verb parts of speech for lookup. | String | No | |
minimumSpan | Minimum number of characters for a term. | int | No | |
otherLookups | List of other parts of speech for lookup. | String[] | No | |
reassignSemantics | Reassign Semantic Types (TUIs) to non-default Semantic Groups. | String[] | No | |
subsume | Subsume contained terms of the same semantic group. | String | No | yes |
subsumeSemantics | Subsume contained terms of the same and certain other semantic groups. | String | No | yes |
totalSkips | Number of total tokens that can be skipped. | int | No |
Annotates clinically-relevant terms. Terms must match dictionary entries exactly.
Source class: DefaultJCasTermAnnotator
Source package: org.apache.ctakes.dictionary.lookup2.ae
Parent class: org.apache.ctakes.dictionary.lookup2.ae.AbstractJCasTermAnnotator
Dependencies: Sentence, Base Token
Products: Identified Annotation
Parameter | Description | Class | Required | Default |
---|---|---|---|---|
DictionaryDescriptor | Path to Dictionary spec xml | String | No | org/apache/ctakes/dictionary/lookup/fast/ sno_rx_16ab.xml |
exclusionTags | Set of exclusion POS tags | String | No | VB,VBD,VBG,VBN,VBP,VBZ,CC,CD,DT,EX,IN,LS,MD,PDT,POS,PP,PP$,PRP,PRP$,RP,TO,WDT,WP,WPS,WRB |
LookupXml | Path to the xml file containing information for dictionary lookup configuration. | String | No | org/apache/ctakes/dictionary/lookup/fast/ sno_rx_16ab.xml |
minimumSpan | Minimum number of characters for a term | int | No | |
windowAnnotations | Type of Lookup window to use | String | No | org.apache.ctakes.typesystem.type.textspan. Sentence |
Annotates clinically-relevant terms. Terms can overlap dictionary entries.
Source class: OverlapJCasTermAnnotator
Source package: org.apache.ctakes.dictionary.lookup2.ae
Parent class: org.apache.ctakes.dictionary.lookup2.ae.AbstractJCasTermAnnotator
Dependencies: Sentence, Base Token
Products: Identified Annotation
Parameter | Description | Class | Required | Default |
---|---|---|---|---|
consecutiveSkips | Number of consecutive non-comma tokens that can be skipped | int | No | |
DictionaryDescriptor | Path to Dictionary spec xml | String | No | org/apache/ctakes/dictionary/lookup/fast/ sno_rx_16ab.xml |
exclusionTags | Set of exclusion POS tags | String | No | VB,VBD,VBG,VBN,VBP,VBZ,CC,CD,DT,EX,IN,LS,MD,PDT,POS,PP,PP$,PRP,PRP$,RP,TO,WDT,WP,WPS,WRB |
LookupXml | Path to the xml file containing information for dictionary lookup configuration. | String | No | org/apache/ctakes/dictionary/lookup/fast/ sno_rx_16ab.xml |
minimumSpan | Minimum number of characters for a term | int | No | |
totalTokenSkips | Number of total tokens that can be skipped | int | No | |
windowAnnotations | Type of Lookup window to use | String | No | org.apache.ctakes.typesystem.type.textspan. Sentence |
Annotates clinically-relevant terms. Terms must match dictionary entries exactly.
Source class: ThreadSafeFastLookup
Source package: org.apache.ctakes.dictionary.lookup2.concurrent
Parent class: org.apache.ctakes.dictionary.lookup2.ae.DefaultJCasTermAnnotator
Dependencies: Sentence, Base Token
Products: Identified Annotation
Parameter | Description | Class | Required | Default |
---|---|---|---|---|
DictionaryDescriptor | Path to Dictionary spec xml | String | No | org/apache/ctakes/dictionary/lookup/fast/ sno_rx_16ab.xml |
exclusionTags | Set of exclusion POS tags | String | No | VB,VBD,VBG,VBN,VBP,VBZ,CC,CD,DT,EX,IN,LS,MD,PDT,POS,PP,PP$,PRP,PRP$,RP,TO,WDT,WP,WPS,WRB |
LookupXml | Path to the xml file containing information for dictionary lookup configuration. | String | No | org/apache/ctakes/dictionary/lookup/fast/ sno_rx_16ab.xml |
minimumSpan | Minimum number of characters for a term | int | No | |
windowAnnotations | Type of Lookup window to use | String | No | org.apache.ctakes.typesystem.type.textspan. Sentence |
Commands and parameters to create a dictionary lookup sub-pipeline.
$\textcolor{gray}{\textsf{// Commands and parameters to create a dictionary lookup sub-pipeline. }}$
$\textcolor{gray}{\textsf{// This is not a full pipeline. }}$
$\textcolor{gray}{\textsf{// path to the xml file containing information for dictionary lookup configuration. }}$
$\textcolor{brown}{\textbf{cli}}$ $\textcolor{purple}{\textbf{LookupXml}}$ =$\textcolor{violet}{\textsf{l}}$
$\textcolor{gray}{\textsf{// umls credentials }}$
$\textcolor{brown}{\textbf{cli}}$ $\textcolor{purple}{\textbf{umlsKey}}$ =$\textcolor{violet}{\textsf{key}}$
$\textcolor{gray}{\textsf{// Default fast dictionary lookup }}$
$\textcolor{green}{\textbf{add}}$ DefaultJCasTermAnnotator
Commands and parameters to create a default dictionary lookup sub-pipeline.
$\textcolor{gray}{\textsf{// Commands and parameters to create a default dictionary lookup sub-pipeline. }}$
$\textcolor{gray}{\textsf{// This is not a full pipeline. }}$
$\textcolor{gray}{\textsf{// path to the xml file containing information for dictionary lookup configuration. }}$
$\textcolor{brown}{\textbf{cli}}$ $\textcolor{purple}{\textbf{LookupXml}}$ =$\textcolor{violet}{\textsf{l}}$
$\textcolor{gray}{\textsf{// umls credentials }}$
$\textcolor{brown}{\textbf{cli}}$ $\textcolor{purple}{\textbf{umlsKey}}$ =$\textcolor{violet}{\textsf{key}}$
$\textcolor{gray}{\textsf{// Default fast dictionary lookup }}$
$\textcolor{green}{\textbf{add}}$ $\textcolor{blue}{\textsf{concurrent.ThreadSafeFastLookup}}$