General Structure, Download and Use - OHDSI/Vocabulary-v5.0 GitHub Wiki

Authors: Christian Reich, Alexander Davydov, Katy Sadowski, Anna Ostropolets

Version: 2.0

Date last updated: 12/09/23

The OHDSI Standardized Vocabularies combine a number of different vocabularies that are used for different aspects of recording healthcare information. These different purposes cause the vocabularies to come with different formats, quality, comprehensiveness and coverage, and life cycle. In order to bring some order into this variability, a number of structural elements were introduced and exposed to the vocabularies, which are described in the following.

Note: Even though the content of the vocabularies are left intact, the OMOP CDM format force a different representation of each vocabulary when compared to its native form. If you need a vocabulary in its native form, do not use the OHDSI Standardized Vocabularies. They are explicitly meant to be used for building an OMOP CDM and in some cases (such as for SNOMED) specific license restrictions limit its use to within an OMOP CDM.

The Vocabularies should not be used for purposes of individual patient healthcare.

Availability, License, Download

Downloading the Vocabularies

Please visit https://athena.ohdsi.org and create an account. It will validate your email as usual. Work email is preferred. It is required if you want to request licensed vocabularies.

Once your account is created you can click on the “Download” button, which will transfer you to the page listing the vocabularies to choose from. Vocabularies with standard concepts are pre-selected. Mandatory vocabularies needed for the functioning of the CDM cannot be unselected and are grayed out. You should select the vocabularies you need for your OMOP CDM instance.

Licensed (proprietary) vocabularies cannot be selected. You need to show proof of a current active license. You cannot get a license from OHDSI. You need to obtain it from the license holder directly. If you need help contacting these organizations contact us under [email protected]. Once you have obtained a license, or you or your organization already have it, click on the key symbol. The Vocabulary Team will contact you for proof of evidence of the license (usually within a week, but sometimes it takes longer). Once you are cleared, the vocabulary will no longer show the key symbol, but can be selected.

Preparing the files and uploading them into OMOP CDM

Within a few hours of clicking “Submit”, you will receive an email with a link to a zip file. Click on the provided link to download the zip file. Typical file sizes, depending on the number of vocabularies you selected, are between 30 and 250 MB.

  • Unzip the file
  • Reconstitute the CPT-4 vocabulary

The latter requires some explanation. With the exception of CPT-4, vocabularies are fully represented in the downloaded zip file. However OHDSI does not have a distribution license to ship CPT-4 codes together with their descriptions. You therefore need to get CPT-4 descriptions from an organization that can distribute them, such as the National Library of Medicine (NLM) through the UMLS. We are sorry about that inconvenience (which it clearly is), but that’s what we have to abide by. However, to make it easy for you, we provide you with a utility that does all of that for you: It contacts the UMLS API using your UMLS key, downloads the descriptions for all CPT-4 concepts, and merges them in with everything else.

To run the reconstitution script:

  • Open a command line in the directory you unpacked all the files into.
  • Run java -Dumls-apikey=xxx -jar cpt4.jar 5

Replace "xxx" with your UMLS API key. If you don't have a UMLS API key, you can sign up and request one from the UMLS login page. It may take up to 3 business days to receive your key.

Note: The reconstitution script may take multiple hours to run due to the way UMLS API handles calls. If the script fails try restarting it (multiple attempts may be needed). As the Vocabulary Team works on ways to improve this experience, you may post on forums if you are experiencing issues.

After the script has finished successfully, you are ready to load the data into your database.

  • Create the OMOP CDM vocabulary tables in your target database (which may be there already from creating the instance). If you have not, find the scripts and instructions for creating OMOP CDM tables here.
  • Load the unpacked & reconstituted vocabulary files into the OMOP CDM tables. Scripts for importing the vocabulary csv files into your OMOP CDM vocabulary tables can be found here. They are provided in the respective folders, e.g. Oracle/, PostgreSQL/ and SQL Server/ etc for all supported SQL dialects in the subfolder VocabImport/. If you have a non-supported SQL dialect you may have to tweak one yourself or ask the community for help. Chances are somebody has done it already.

Alternatively, the ETL-Synthea R package contains a function LoadVocabFromCsv which can be used for the same purpose.

Domain and Vocabularies

The Standardized Vocabularies are organized into Domains and Vocabularies. Domains refer to the nature or type of a clinical entity. It also defines the CDM data table where a data record is to be stored. Vocabularies are sets of concepts imported from an external national or international existing standard, or de-novo created by the Vocabulary Team if no suitable standard is available.

Note: There is no one-to-one relationship between domains and vocabularies. Some vocabularies are very broad, such as SNOMED or Read, and contain concepts of all medical domains. Other vocabularies are specific to a certain domain, such as RxNorm for Drugs or ICD9Proc for Procedures. In many cases, Vocabularies are generally assumed in the community to be of a single domain, when in fact they are not. For example, CPT4 and HCPCS are expected by their name to contain Procedure codes only, but in reality contain Observation, Condition, Device, and Drug concepts.

Domains and Vocabs

Standard, Classification, and Source Concepts

Within a domain, codes come from a number of vocabularies, and their codes have often identical or overlapping meanings. To bring order to this situation, each of them is assigned one of three designations:

Standard Concept (standard_concept = 'S')

Standard Concepts are the “official” Concepts that are to be used to represent a unique clinical entity in the standardized clinical data tables of the OMOP CDM. Their concept ID is written to the respective *_concept_id field. Usually, Standard Concepts are sourced from well-established vocabularies that have comprehensive coverage of the domain and the concepts are well-defined. For example in the Condition domain, this is achieved through the SNOMED vocabulary. If no comprehensive list of available entities is available in a certain domain, such as the Device domain, Standard Concepts come from a variety of different Vocabularies. The same is true for the Procedure and Visit domains.

Classification Concepts (standard_concept = 'C')

These have a hierarchical relationship to Standard Concepts and can therefore be used to query for Standard Concepts using the records of the CONCEPT_ANCESTOR table. However, they themselves cannot appear in the data tables. For example, the concept 4283987 “ANTICOAGULANTS” of the Vocabulary “VA Class” cannot appear in the DRUG_EXPOSURE or DRUG_ERA tables, but its descendant concepts that have the Concept Class “Ingredient”, “Clinical Drug” or “Branded Drug” can.

Classification Concepts may be sourced from different Vocabularies than the Standard Concepts. Note that Classification Concepts are not unique. For example, there are Concepts for the Drug Class “Anticoagulants” coming from the NDFRT, VA Class, ETC, and ATC vocabularies. Also, note that the membership depends on the Vocabulary. In most cases, the membership list of equivalent Classification Concepts is similar or identical, but medical science does not provide a generally agreed upon standard definition of these classes.

Source Concepts (standard_concept = NULL)

These are all remaining Concepts that are not Standard or Classification Concepts. Note that Concepts can change their designation over time: if they are deprecated (valid_end_date is less than 2099-12-31 and invalid_reason = 'D' or 'U'), formerly Standard or Classification Concepts will turn into Source Concepts.

Source Concepts can only appear in the *_source_concept_id fields of the data tables. They represent the code in the source data. Each Source Concept is mapped to one or more Standard Concepts during the ETL process. If no mapping is available, the Standard Concept with the concept_id = 0 is written into the *_concept_id field. See the OMOP CDM Conventions for details.

For all Concepts in a domain, this creates the following logical structure:

Concept Structure

Source Concepts are mapped through a mapping to Standard Concepts and these have relationships of various semantic natures to each other. In addition, they have hierarchical relationships to each other, and the Classification Concepts, in this case, are derived from the vocabularies A, B, and C. Hierarchical relationships amongst Classification Concepts generally only happen between Concepts of the same vocabulary.

All concepts are stored in the CONCEPT table. Find more details about the vocabulary tables in the CDM documentation.

The designation of a concept as Standard, Classification, or Source depends on the particular concepts in a vocabulary. See the specifications for more details on each vocabulary.

Relationships

We map the non-standard source concepts to standard ones (relationship_id "Maps to"). Source concepts without clear semantic content or outside the realm of observational research are not mapped. We adopt mappings from external sources or create them de novo. For successful standardization we aim at comprehensive coverage of all source concepts, which is a substantial task of importing, reviewing, modifying, and validating the maps.

We connect standard and classification concepts through polyhierarchies, defined as hierarchical trees allowing for more than one parent per concept. Non-standard concepts are not included in hierarchies, even though they may have a hierarchy in their source vocabulary. For example, ICD10 concepts being all non-standard come with a simple hierarchy, which is not included in our polyhierarchy. On the other hand, standard SNOMED-CT concepts already have an internal hierarchy, to which we append ICDO3 concepts, forming a common hierarchical structure for the Condition domain. Like with mapping relationships, we aim at building a comprehensive hierarchical structure, which requires substantial generation, review, and validation of hierarchical relationships.

Non-hierarchical (often called “part-of”) relationships are not curated by OHDSI but may be imported if available from the source vocabulary for convenience. We make no attempt to create a comprehensive semantic knowledge base of non-mapping or hierarchical relationships between concepts. All relationships are stored in the CONCEPT_RELATIONSHIP and all generation-spanning hierarchical relationships in the CONCEPT_ANCESTOR table, which we build based on the content of the CONCEPT_RELATIONSHIP table.

Data ETL

The most important impact the Standardized Vocabularies have on the ETL process from raw to CDM-formatted data is the Domain of each Concept. Irrespective from which source table a record comes, or what coding scheme it is represented by, the destination table will be determined by the domain_id of the respective Concept. Any ETL will have to follow the following logic for process every record in the source data:

  • Retrieve the source code from the record.
  • Find the Concept that corresponds to the source code. In most cases, that is done by looking up the source code in the concept_code field with the correct content of the vocabulary_id field in the CONCEPT table. Sometimes, the source code needs to be manipulated to find the right match. For example, ICD-9-CM codes have a dot after the 2nd or 3rd character, but in source data they are often stored without the dot. Or NDC codes come in a variety of formats (with dashes and asterisks or without, 9 or 11 digit), making the mapping process more complicated. Look into the specification of each Vocabulary for specific recommendations of the lookup process.
  • Map to a Standard Concept (standard_concept='S') by retrieving all the active (invalid_reason field should be NULL) records from the CONCEPT_RELATIONSHIP table. Use concept_id_1 for the Concept you want to map and concept_id_2 for the destination Concept, with relationship_id='Maps to'. If the source code is a local code and you have a SOURCE_TO_CONCEPT_MAP table for these, determine the destination Concept from the target_concept_id field. The destination Concept could be the Concept itself if it happens to be a Standard Concept, or a Concept in another vocabulary. In most cases, a source code maps to a single destination Standard Concept, but in some cases it could be two or three.
  • Write a record in the corresponding CDM table for each destination Standard Concept based on the content of the domain_id field. Place the ID of the Standard Concept into the <domain>_concept_id, the Source Concept into the <domain>_source_concept_id and the source code into the <domain>_source_value field. Most data tables require a start_date and a Type Concept, and some have more fields that need consideration. See the details of each table in the CDM Specifications. The corresponding table/field combination for each Domain is as follows:
domain_id CDM table Field Comment
Generic Any Any Generic Concepts can be in any field that ends in concept_id.
Gender PERSON gender_concept_id
Race PERSON race_concept_id
Ethnicity PERSON ethnicity_concept_id
Visit VISIT_OCCURRENCE visit_concept_id
Procedure PROCEDURE_OCCURRENCE procedure_concept_id
Modifier PROCEDURE_OCCURRENCE modifier_concept_id
Drug DRUG_EXPOSURE drug_concept_id
Route DRUG_EXPOSURE route_concept_id
Unit MEASUREMENT or OBSERVATION or SPECIMEN unit_concept_id Units are used in different contexts. *
Device DEVICE_EXPOSURE device_concept_id
Condition CONDITION_OCCURRENCE condition_concept_id
Measurement MEASUREMENT measurement_concept_id
Meas Value Operator MEASUREMENT operator_concept_id
Meas Value MEASUREMENT value_as_concept_id
Observation OBSERVATION observation_concept_id
Relationship FACT_RELATIONSHIP relationship_concept_id
Place of Service CARE_SITE place_of_service_concept_id
Provider Specialty PROVIDER specialty_concept_id
Currency MEASUREMENT or OBSERVATION or SPECIMEN currency_concept_id Currency values appear in any of the *_COST tables. *
Revenue Code PROCEDURE_COST revenue_code_concept_id
Specimen SPECIMEN specimen_concept_id
Spec Anatomic Site SPECIMEN or MEASUREMENT or OBSERVATION anatomic_site_concept_id or value_as_concept_id or value_as_concept_id Anatomical Site Concepts are used to characterize the origin of a Specimen, but also the result of a Measurement or Observation. *
Spec Disease Status SPECIMEN disease_status_concept_id
  • If there is more than one potential destination table the ETL needs to identify the context in which a Standard Concept is used, and select the right table from this table.

  • For some Source Concepts there is no mapping to a Standard Concept (there is no record in the CONCEPT_RELATIONSHIP or SOURCE_TO_CONCEPT_MAP tables), usually because the Source Concept is too generic or otherwise ill-defined. In these cases, the domain_id field of the Source Concept itself should be used to place the record in the right table. Some Source Concepts have combination Domains, such as “Device/Procedure”. In these cases, write a record into each of the combination Domain table (in this case to the DEVICE_OCCURRENCE and PROCEDURE_OCCURRENCE tables). As the Standard Concept write Concept 0 (concept_id=0) into the respective <domain>_concept_id field.

If in the mapping process a Concept of the “Domain” or “Metadata” Domains are retrieved, an error has snuck into the mapping table. Please report those cases in the CDM-Builder.

The same is true if a Type Concept Domain, like “Obs Period Type”, “Death Type”, “Visit Type”, “Procedure Type”, etc. is produced. Though these Type Concepts are valid concepts and have to be placed into the <domain>_type_concept_id field of the respective CDM tables, they cannot be the result of the mapping process. They denote the origin of the record and the selection of Type Concepts should be hard-wired into the ETL process.

⚠️ **GitHub.com Fallback** ⚠️