Quality assurance and control - OHDSI/Vocabulary-v5.0 GitHub Wiki
This document summarizes the current quality assurance and control process for vocabulary.
Authors: Christian Reich, Anna Ostropolets, Alexander Davydov
Version: 1.2
Date last updated: 10/22/24
Vocabularies enter the overall Standardized Vocabulary system through a process called "refresh". Even the initial introduction of a vocabulary is treated as a Refresh, except against an empty prior version. The refresh introduces new or modifies existing records in the CONCEPT, CONCEPT_SYNONYM and CONCEPT_RELATIONSHIP tables. Other vocabulary tables such as VOCABULARY, RELATIONSHIP, CONCEPT_CLASS etc. are not handled through this process, they are maintained manually.
Refreshing the Vocabularies and producing a new version consists of three steps: staging, integration and release (more details can be found here). QC is performed throughout this process.
Currently, quality is managed through four distinct mechanisms:
- Quality issue tracking and resolution. If users in the OHDSI community find an issue with the vocabularies or have a question they post these either in the Vocabulary category of the OHDSI Forum or in the GitHub Issue section of the Vocabulary repository. Both methods are equally welcome, the Forum has greater visibility, but GitHub is used as a mechanism for discussing and closing issues. Both resources are monitored and responded to the users for feedback and clarification. Team members take turns in having monitor duty.
- Quality control scripts. These scripts contain tests to detect deviations from the rules of how the content of the vocabulary tables is expected to behave. Currently, these are part of the intermediate check (check_stage_table) and final check (get_checks) scripts (see below). Intermediate checks are mostly conformance checks. Some of drug input table checks are also logic (integrity) checks. Final checks are logic checks.
- Quality enforcing scripts. These scripts enforce rules and overwrite content if it deviates. These are realized in the generic_update script (see below).
- Manual content check. Before the release, semantic content and the integrity of refresh is compared to the current version of the Vocabularies and delta is checked by the team against a number of criteria and conventions.
- Assessment of compliance with principles. We assess comprehensive coverage, predefined domains, unique standard concept and other Vocabularies' principles.
These checks are performed at the stage of processing (normalizing) of the source vocabularies. They are conducted on stage tables (such as CONCEPT_STAGE and CONCEPT_RELATIONSHIP_STAGE). Additionally, drug vocabularies undergo tests on the stage of input tables (such as DRUG_CONCEPT_STAGE and INTERNAL_RELATIONSHIP_STAGE).
Input table checks include the following scripts:
- input_QA_integratable_E.sql (errors, should be fixed; query must return nothing)
- input_QA_integratable_W.sql (warnings, should be inspected)
- drug_stage_tables_QA.sql(contain some of the useful checks, should be inspected)
These queries are checking conformance and integrity of the input, such as:
- Dosages without units.
- Ingredients without doses.
- Doses above 1 mg/mg or 1000 mg/mL.
- Solid forms (e.g. tablets and capsules) containing liquid doses.
- Ingredients used as brand names (Clopidogrel/Aspirin or Ibuprofen and Codeine).
- Undermapping of attributes leading to duplications.
This script has 46 conformance checks, each of which must pass before moving to the next step:
Table CONCEPT_RELATIONSHIP_STAGE
- A vocabulary_id is used, that is not defined by the "Set Latest Update" function.
- Valid_start_date is missing.
- Valid_end_date is missing.
- Invalid_reason logic doesn't match the actual valid_end_date (2 conditions are controlled: concept valid and invalidated).
- Valid_end_date < valid_start_date logical error.
- Valid_start_date is not converted to a day precision.
- Valid_end_date is not converted to a day precision.
- Wrong (other than "D") value for invalid_reason.
- Concept_code_1 is empty.
- Concept_code_2 is empty.
- The concept (concept_code_1 / vocabulary_id_1 pair) is missing from the basic/stage concept table.
- The concept (concept_code_2 / vocabulary_id_2 pair) is missing from the basic/stage concept table.
- Vocabulary_id_1 is missing from the reference table.
- Vocabulary_id_2 is missing from the reference table.
- Relationship_id is missing from the reference table.
- Valid_start_date is greater than the current date.
- Valid_start_date is before a lover limit (1900 Jan 1st).
- Duplicated rows of the same source and target concepts linked by the same relationship type (dates are ignored)
Table CONCEPT_STAGE
- A vocabulary_id is used, that is not defined by the "Set Latest Update" function.
- Vocabulary_id is missing from the reference table.
- Valid_end_date < valid_start_date logical error that won't be addressed by the "Generic update" function by the latest_update variable rule.
- Wrong (other than "D" or "U") value for invalid_reason.
- Valid_start_date is not converted to a day precision.
- Valid_end_date is not converted to a day precision.
- Invalid_reason logic doesn't match the actual valid_end_date
- Domain_id is missing from the reference table.
- Concept_class_id is missing from the reference table.
- Wrong (other than "S" or "C") value for standard_concept.
- Valid_start_date is missing.
- Valid_end_date is missing.
- Valid_start_date is before a lover limit (1900 Jan 1st).
- Concept_name is missing.
- Concept_code is missing.
- Concept_name is not trimmed for excessive characters.
- Concept_code is not trimmed for excessive characters.
- Duplicated rows of the same concept_code and vocabulary_id.
Table CONCEPT_SYNONYM_STAGE
- A vocabulary_id is used, that is not defined by the "Set Latest Update" function.
- Vocabulary_id is missing from the reference table.
- Synonym_name is missing.
- Synonym_concept_code is missing.
- The concept (synonym_concept_code / synonym_vocabulary_id pair) is missing from the basic/stage concept table.
- Synonym_name is not trimmed for excessive characters.
- Synonym_concept_code is not trimmed for excessive characters.
- Language_concept_id is missing from the concept table.
Table PACK_CONTENT_STAGE
- Duplicated rows of the same concept_code, pack_vocabulary_id, drug_concept_code, drug_vocabulary_id and amount.
Table DRUG_STRENGTH_STAGE
- Duplicated rows of the same drug_concept_code, vocabulary_id_1, ingredient_concept_code, vocabulary_id_2 and amount_value.
Final checks are performed before the release of basic tables (CONCEPT, CONCEPT_RELATIONSHIP, etc.) occurs and must pass for content to be transferred to production schema and released.
Script get_checks contains conformance and logic (integrity) checks.
- Relationship cycle between two concepts (for "Maps to" it covers cycle between any number of concepts).
- Opposing relationships between same pair of concepts.
- Relationships without their reverse counterpart.
- "Maps to" to invalid concept.
- Replacement relationships to deprecated concept.
- Direct and reverse mappings are not same.
- Valid_end_date = '12/31/2099' and invalid_reason is not null for relationships and concepts.
- Invalid_reason is null and valid_end_date!='12/31/2099' for relationships and concepts.
- Valid_start_date > current_date for relationships and more than 15 years from the date of source vocabulary release for concepts (to accommodate NDC and HCPCS).
- One concept has multiple replacements.
- Concept_code = 'OMOP generated' for OMOP-built concepts (should be OMOP+sequence).
- Duplicate 'OMOP generated' concepts.
- More than one standard concept with the same name in 'OMOP Extension' vocabulary.
One check is disabled (more than one standard concept with the same name in RxNorm and RxNorm Extension).
Generic update script, apart from integrating staged vocabularies into the existing base, enforces rules and conventions, potentially overwriting incorrect information. As content is moved, it carries out the following functions:
- Clean up the concept_id in the stage tables (concept_stage, concept_relationship_stage, concept_synonym_stage).
- Force concepts with valid_end_date of 2099-12-31 to invalid status.
- Force invalid concepts to non-Standard status.
- Copy concept_id from basic to stage concept table for already existing concepts.
- Clean up Concept name from double space, carriage return, newline, vertical tab, feed, long dash, trailing escape.
- Clean up synonym_name from double space, carriage return, newline, vertical tab, feed, long dash, trailing escape.
- Update (overwrite) concept details of existing concepts from the concept_stage table for concept_name, domain_id, concept_class_id, standard_concept, valid_end_date, invalid_reason fields, and the valid_start_date unless there is a latest_update variable placeholder.
- For full replacement vocabularies, deprecate (make non_Standard and set valid_end_date) concepts that are missing from the concept_stage table.
- Set a valid_end_date for concepts that are missing from the CONCEPT_STAGE table (only applied for deprecated but still standard (so-called "zombie") concepts.
- Transfer content from stage to basic tables in DevV5 (concepts, relationships and synonyms).
- Generating new concept_id for new records.
- Deprecate missing concepts and relationships.
- Deprecate relationship records to deprecated concepts.
- Create "Maps to" itself for Standard concepts.
- Create reverse relationship records unless they exist.
- Update valid_end_date: place today's date or the date from latest_update, or 2099-12-31 to active concepts.
These checks require subject matter expertise and ensure semantic quality. There are 17 general and some vocabulary-specific checks: 52 for the ICD10 family, 16 for RxNorm, 11 for SNOMED and 3 for LOINC). For efficiency, a corpus of concepts is sorted by the fields standard_concept, concept_class_id, vocabulary_id and domain_id as these often represent logical groups and analytical use cases and checked for correct assignment to these attributes.
- Duplicate concepts.
This check retrieves the list of concepts with the same name as an existing one. This may indicate that the source vocabulary is wrongly processed or has duplicate content.
- Concepts changing domain.
In this step the domain_id of concepts is checked for correctness and compliance with current conventions and approaches. Domain changes may be justified by multiple reasons:
- Based on domain of the target concept,
- Source hierarchy change,
- Manual curation of the content by the vocabulary team,
- domain assigning script change or its unexpected behavior.
- Domain of newly added concepts.
These are checked for plausibility and compliance with conventions and approaches. Domain assignments may be based on:
- Domain of the target concept and script logic on top of that,
- Source hierarchy,
- Manual curation of the content by the vocabulary folks,
- Hardcoded.
- Concepts changing name.
Name changes are sorted by similarity between old and new to prioritize the more significant changes and, depending on the volume of content, for defining a review threshold. Significant or structural changes in concept semantics are not allowed and may indicate code reuse by the source or a flaw in source name processing. Minor changes and alteration in the precision are allowed.
- Concepts changing synonyms.
Similar to the concept_name, significant changes in synonym_name semantics are not allowed and may indicate code reuse by the source. Structural changes or significant changes in the content volume (synonyms of additional language or property) may indicate a flaw in synonym processing. Minor changes and alteration in the precision are allowed. This check also makes sure concepts do not change their concept_code or concept_id, while keeping the concept_synonym_name largely the same.
- Concepts changing concept_code and concept_id.
This check makes sure concepts do not change their concept_code or concept_id, while keeping the concept_name or concept_synonym_name largely the same.
- New concepts lacking mapping.
This check reviews new concepts that lack "Maps to" links to Standard concepts, which may represent multiple scenarios:
- Some concepts do not require "Maps to" links because they generally have no standard representation, such as drug brand names, drug forms, etc.,
- Some concepts are not yet represented in their standard counterparts, such as new drugs or vaccines where concepts from an authoritative source such as RxNorm or CVX are awaiting an update,
- Newly deprecated concepts that are OMOP-generated need no replacement,
- Ill-designed concepts from authoritative sources such as SNOMED cannot be explicitly mapped to any other standard target.
- New concepts and their mapping ("Maps to", "Maps to value").
This check focusses on the completeness of content and alignment of the use cases and mapping scenario:
- Mapping to self,
- Mapping within vocabulary or to concepts in other vocabularies.
- Concepts changing their mapping ("Maps to", 'Maps to value').
This could include possible scenarios:
- The mapping has changed,
- The mapping presented before, but is missing now,
- There is mapping present in one but absent in the other,
- Multiple "Maps to" or 'Maps to value' links,
- Frequent target concept.
- Concepts lacking hierarchical "Is a" relationships.
There could be multiple scenarios for this scenario:
- Non-standard concepts
- Standard concepts of a source vocabulary that does not provide hierarchical links and there is no manual hierarchy construction program, such as for devices or procedures,
- Concepts of concept classes that cannot be hierarchically linked, such as units, methods, scales,
- Top level concepts.
- Concepts changed their hierarchical "Is a" relationships.
This could include different scenarios:
- Ancestor(s) have changed,
- Ancestor(s) are present in one version, absent in another,
- Ancestor(s) are present before, but is missing now,
- Multiple "Is a" links,
- Too frequent ancestor concept.
- Concepts with 1-to-many "Maps to" mappings.
This could include multiple scenarios:
- Source complex concepts are split up and mapped over to multiple targets,
- Oxygen-containing devices are mapped to itself and oxygen ingredient.
- Concepts became non-standard with no replacement mapping.
This could include multiple scenarios:
- Source vocabularies generally do not provide update concepts for deprecated ones,
- Deprecated concepts that are deprecated but still valid (so-called "zombie" concepts), where this is tolerated,
- Concepts that were previously wrongly designed by the source (e.g. SNOMED) are now deprecated and no useful replacement can be had.
- Concepts are presented in CRM with "Maps to" link but end up with no valid "Maps to" in basic tables.
This check controls that concepts that are manually mapped within the CONCEPT_RELATIONSHIP_MANUAL table have Standard target concepts, and links are properly processed by the vocabulary machinery.
- Mapping of vaccines.
Because mapping of vaccines is complex a full manual review is needed. Vaccines have attributes other drugs do not have, such as causal agent, specific strains or year.
- Mapping of COVID-19 concepts.
This check retrieves the mapping of COVID-19 concepts to Standard targets.
Because of mapping complexity and trickiness, and depending on the way the mappings were produced, full manual review may be needed.
- Concepts have replacement links but miss "Maps to" link.
This check controls that all replacement links are repeated with the "Maps to" link that are used in ETL.
Now it's not resolved in SNOMED and some other places and requires additional attention. Review p.5 of "What's New" chapter here.
Along with quality assurance and control, we assess compliance of the OHDSI Vocabularies with the set of pre-defined principles. We published associated reports with each release. More details can be found here.