Mitigate Concept ID Misalignment in the OMOP CDM data ingestion and harmonization pipeline and Resolve Terminology Drift - National-Clinical-Cohort-Collaborative/Data-Ingestion-and-Harmonization GitHub Wiki
N3C Clinical and COVID Data Pipeline - OMOP CDM Data Ingestion and Terminology Translation
October 2023
The N3C Clinical Enclave and COVID data pipelines are implemented through a series of data transformation steps which include extracting, transforming, and loading the transformed data into an OMOP Common Data Model (CDM) instance. The data partners’ datasets terminology translation uses a unified version of the OMOP vocabulary tables in the Enclave.
The data partners submit a compressed zip file into the pipeline in one of the predefined CDM layout formats used by the CTSA consortium (i.e. PCORnet, ACT, TriNetX, PEDSnet, and OMOP CDM). Currently there are five distinct ingest pipeline implementations, one for each CDM model as shown in Figure 1.
Figure 1.
Non-OMOP CDM Data Ingestion Harmonizaion pipeline
When data is submitted in a non-OMOP CDM format, it undergoes multiple ingestion steps, including a terminology translation step that maps source-coded data to OMOP vocabulary concepts. This process ensures that research data from disparate healthcare systems is standardized using a unified version of the OMOP vocabulary tables.
By applying a single OMOP vocabulary version during translation, the characteristics of clinical codes in the source data are consistently and uniformly mapped, minimizing variability and ensuring alignment with OMOP standards. As a result, the “meaning” behind the coded clinical data is preserved, with each concept interpreted according to the standardized terminology assignment in the OMOP vocabulary. This single-version translation approach is applied uniformly across all submitted datasets, ensuring consistent mapping of OMOP concept IDs across sites.
OMOP-CDM Data Ingestion Harmonizaion pipeline
When site data is submitted in the OMOP CDM format, the terminology translation step in the ingestion pipeline is skipped. This is because the data has already undergone translation at the data partner site, where concepts are coded using _domain_concept_id.
However, semantic drift occurs when different versions of the OMOP vocabulary tables are used prior to data submission. Below is a list of the OMOP vocabulary versions used by the site:
Figure 2.
Handling Terminology Drift in OMOP CDM Data Submission
This discrepancy leads to data quality issues in the N3C Enclave, where the latest unified version of the OMOP vocabulary tables is used. Older vocabulary versions may contain deprecated, deleted, updated, or newly added concepts compared to the latest released version. As a result, OMOP CDM data processed with outdated vocabulary versions may exhibit terminology drift, affecting concept sets used in phenotype definitions.
The impact of this terminology drift increases as the version gap widens, as more concepts may become non-standard with each vocabulary update. This misalignment poses challenges for downstream analyses that rely on consistent, standardized terminologies. Overtime the semantic drift with increasing number of standard concepts becoming non-standard.
Figure 3 - data_partner_ids are hidden on the y-axis on purpose
To address terminology drift, we have updated the OMOP pipeline to re-align terminology translation with the unified version of the OMOP vocabulary tables used in the Enclave. This ensures consistency with the vocabulary version applied to non-OMOP CDM data.
Figure 4 -
Overall process is fairly simple. We collect all of the non-standard concepts found in all the domains. Re-map to find the standard concept using the “Maps to” relationship. Re-insert the data using the standard concept. If there are still remaining non-standard concepts, insert them back to the respective domains as not to lose any data that the site is submitting. The overall process is as follows:
- Identify Non-Standard Concepts – Collect all non-standard concepts across all domains.
- Re-map to Standard Concepts – Use the "Maps to" relationship in the OMOP vocabulary to find the corresponding standard concept.
- Re-insert Data with Standard Concepts – Replace non-standard concepts with their mapped standard equivalents.
- Preserve Unmapped Concepts – If any non-standard concepts remain unmapped, retain them within their respective domains to ensure no submitted data is lost.
Figure 5 - Improved OMOP CDM ingestion and harmonization pipeline overview.
The OMOP pipeline has been updated to collect all non-standard concepts and re-translate them using the "Maps to" relationship in the OMOP vocabulary. Any concepts that cannot be mapped through this process are retained in their original source domain to ensure data completeness.
This approach preserves data integrity while aligning terminology with the unified version of the OMOP vocabulary tables in the Enclave, ensuring consistency across datasets.