Dictionary Creator - apache/ctakes GitHub Wiki

The default configuration of Dictionary Lookup uses an hsqldb database containing terms and normalized codes (CUIs).  Dictionary databases containing typically desired information from the UMLS are available at sourceforge .

However, there may be cases for which the standard dictionaries are not applicable.  For this reason, cTAKES has a GUI that can assist in the creation of custom dictionaries.  The GUI currently only allows the most basic customization: Desired source vocabularies, semantic types, and additional vocabulary codes of interest.  

    * Greater customization is available, but requires the editing of property files and is outside the scope of this document.

UMLS Installation

The Dictionary Creator GUI requires a local installation of UMLS. UMLS releases can be downloaded from the NLM website as zip files. Each zip file includes a utility called MetamorphoSys that allows you to choose which vocabularies from the UMLS you want to install onto your machine. Any vocabulary you want to include in your custom dictionary will need to be selected as part of this installation. You can also select from recommended default vocabulary lists. In addition, for all vocabularies you install any synonyms they include for CUIs shared with the vocabularies you choose in the Source column of the Dictionary Creator GUI will be included in the cTAKES dictionary. This is worth considering because you might not want synonyms from all UMLS vocabularies. For example, some vocabularies include slang or abbreviations that you may or may not want depending on your use case.

Step-by-step guide

  1. From a command-line in the cTAKES root directory, execute:   bin\runDictionaryCreator
      

  2. Select a cTAKES installation directory.  The default directory should be correct.

  3. Select a UMLS installation directory.  This is the directory containing the META/ subdirectory (which contains RRF files). Specifically, the MRCONSO.RRF and MRSTY.RRF files are used.
    After selecting the UMLS installation directory, the available vocabularies are gathered.
      

  4. Select Source Vocabularies.  Source vocabularies contain CUIs that interest you. Selecting a vocabulary will include all CUIs that exist in that source, and all the synonyms for those CUIs from all installed vocabularies. 

  5. Select Target Vocabularies.  The dictionary will contain target vocabulary-specific codes for any vocabularies selected here. The most likely scenario is that you want to choose a vocabulary as Source and Target. This includes the strings associated with the terms from that vocabulary and that vocabulary's specific codes. If the vocabulary is not included as a Target the Dictionary Lookup will only be able to populate the results with the UMLS CUIs, rather than the specific codes from that vocabulary.

  6. Select Semantic Types.  The standard cTAKES types are selected by default. Each UMLS term is assigned a semantic type, designated by a TUI. Not all semantic types are clinical in nature. The TUIs you choose will limit the terms from Source dictionaries that will be included in your cTAKES dictionary. Semantic types outside the defaults may be of interest depending on your use case. The Dictionary Lookup uses the semantic type of a UMLS term to determine which type of mention in the cTAKES type system to assign to an instance of that term found in the text. If a TUI does not have a mapping to a specific cTAKES mention type, it will be assigned a type of EntityMention.

  7. Type a Dictionary Name.  Use all lower case.

  8. Click Build Dictionary.

The dictionary will be created and stored in CTAKES_HOME/resources/``org/apache/ctakes/dictionary/lookup/fast/DictionaryName/. The main file is a .script file that includes the SQL commands used to automatically populate the hsqldb database at runtime. There will also be an xml descriptor file in CTAKES_HOME/resources/org/apache/ctakes/dictionary/lookup/fast/DictionaryName.xml that describes the dictionary and contains some additional settings that can determine how the Dictionary Lookup behaves. See the Fast Dictionary Lookup Component page for more details.

The Dictionary Creator GUI performs some heuristics during dictionary creation to attempt to resolve conflicts for strings that resolve to multiple terms. This means the vocabularies installed within your UMLS installation and the semantic types selected can have subtle effects on the resulting dictionary as the Dictionary Creator GUI attempts to assign strings to specific CUIs.

Once a new dictionary has been built, point to it in one of 2 ways:

Set the fast dictionary parameter LookupXml to org/apache/ctakes/dictionary/lookup/fast/DictionaryName.xml. You can change this in the piper file: 

add DefaultJCasTermAnnotator LookupXml=org/apache/ctakes/dictionary/lookup/fast/DictionaryName.xml 

or

Set the runClinicalPipeline or runPiperFile command-line parameter -l to org/apache/ctakes/dictionary/lookup/fast/DictionaryName.xml

UMLS License

Please ensure that you comply with the UMLS License.

Related articles

⚠️ **GitHub.com Fallback** ⚠️