Common Data Environment (CDE) for OMOP Vocabularies mapping, harmonization and reuse - OHDSI/Vocabulary-v5.0 GitHub Wiki

Common Data Environment (CDE) for OMOP Vocabularies mapping, harmonization and reuse

Observational health data often contain similar information from various sources, such as controlled medical terminologies or free-text sources, leading to mapping discrepancies within OMOP vocabularies and across different datasets (Fig.1).

alt_text

Fig. 1 – Example of mapping discrepancies between different ICD vocabularies

Curation of mappings can be a time-consuming task that requires expertise in both medical and terminology domains. By optimizing mapping strategies, efficiency and accuracy can be greatly improved.

The OHDSI Vocabulary team has been working on developing automated processes for updating mapping targets and creating a common data environment (CDE) for custom data sets. This project aims to further enhance our efforts by creating a universal CDE that can accommodate all types of source data.

The main focus is to create a framework for organizing various source data, storing mapping candidates from different sources, and facilitating decision-making on the best mappings to use. To address the indicated challenges, we have created the universal CDE and implemented it for the ICD family vocabularies refresh during the ICD overhaul (February 2024 Vocabulary release [2]) and August 2024 release.

CDE is a universal way of data organization for vocabularies, groups of vocabularies (like ICD), and ETL data, irrespective of the origin. The CDE is used to create more precise and effective mappings by reusing existing mappings for close semantic matches and by effective detection and resolving mapping discrepancies (Fig.2).

alt_text

Fig. 2 – The common data environment work- and data flow

The idea behind the CDE is to create groups of concepts that share semantic entities (clinical facts). Some of them are used to produce mappings, while others are used to support mapping control:

  • Strict group – concepts with the same or insignificantly different (according to OMOP use cases) meaning. This kind of group has a name that reflects the meaning of the group most precisely. This group name is used for mapping.
  • Medium group – concepts with close but not identical meanings. Medium groups are used to align mappings for the related concepts.
  • Broad group – groups by other ways of grouping (e.g. hierarchical ICD categories or other classifications).

The CDE contains concept codes and concept names, mapping with their sources and metadata. Mapping candidates originate from automated and manual mapping tables, automatic mapping replacements, external mapping sources, such as SNOMED to ICD10CM equivalence tables, and a community contribution. The format of metadata in CDE is SSSOM-compatible [3, 4] and includes predicate_id (‘Maps to equivalent’, ‘Maps to uphill’, ‘Maps to downhill’, etc.), mapping_tool and mapping_justification. The actual CDE structure is represented in Table 1.

Table 1. CDE structure

Field name Content
concept_code source_code or its equivalent
concept_name source_code_description or its equivalent
hierarchy ancestors of the source concept, NULLable
vocabulary_id source_vocabulary_id or its equivalent
concept_class_id concept_class_id, NULLable
domain_id domain_id, NULLable
counts used only for custom data sets, NULLable
group_id strict_group_id assigned during concepts semantic grouping
group_name strict_group_name assigned during semantic grouping, which is used for mappings
group_code vocabulary_id:source_code pairs array
medium_group_id medium_group_id assigned during concept grouping
broad_group_id broad_group_id assigned during concept grouping
relationship_id relationship_id (‘Maps to’, ‘Maps to value’)
relationship_predicate_id relationship_id predicate (‘equivalent’, ‘uphill’, ‘downhill’)
confidence how confident the reviewer is in the mapping (0,9; 1,0, etc.)
decision ‘1’ if the mapping candidate was chosen as suitable, NULL – if the mapping candidate was rejected
decision_date date of the decision about mapping candidate suitability
target_concept_id concept_id of the Standard target concept
target_concept_code concept_code of the Standard target concept
target_concept_name concept_name of the Standard target concept
target_concept_class_id concept_class_id of the Standard target concept
target_sandard_concept standard_concept field of Standard target concept
target_invalid reason invalid reason field of Standard target concept
target_domain_id domain_id of the Standard target concept
target_vocabulary_id vocabulary_id of the Standard target concept
mappings_origin information about the mapping sources
mapping_valid_start_date relationship valid_start_date
mapping_valid_end_date relationship valid_end_date

We are currently implementing the developed CDE for the mapping refresh and harmonization across ICD family vocabularies.

  1. We gathered the ICD10, ICD10CM, and ICD9CM data from the automated and manual mapping tables. For local versions such as ICD10GM, ICD10CN, KCD7, and CIM10, only data from manual tables was inserted, as the majority of mappings for local versions are directly reused from ICD10 vocabulary. Unique pairs of source_code and target_concept_id were inserted, with the mapping origin preserved in the ‘mappings_origin’ field.
  2. For mapping refresh, we additionally include potential replacement mappings for source codes mapped to non-standard or invalid concepts. These mappings were proposed based on valid ‘Concept poss_eq to’ relationships or other suitable links.
  3. The concepts are then grouped into strict groups based on criteria, such as identical source_code_description or identical mappings for unprocessed source codes. The script groups the concepts as well as assigns a group name.
  4. Once the concepts are grouped, we transfer the group_name, group_code, and mappings to table editor software for mapping curation and decision-making (Table 2). Medical terminologists could regroup the concepts if needed in the process of manual curation.

Table 2. Example of the CDE at the manual work step

alt_text

After the initial grouping and mapping curation, all members of a group are assigned the same target_concept_id, which was then implemented through each vocabulary load_stage (Fig.3). During the ICD overhaul this process resulted in the creation of 116,705 semantic groups across ICD vocabularies and the resolution of 3,032 mapping conflicts. 79 groups contained mapping incorporated from the community contributions, resulting in the harmonization of mappings of 350 concepts.

alt_text
Fig.3 Representation of common data environment implementation result

Application of the Common Data Environment approach in source data processing can greatly improve efficiency in repetitive processes of data mapping. This approach can be efficiently implemented for both vocabularies and custom data, enhancing the accuracy of data processing, and ultimately leading to the creation of higher-quality OMOP CDM instances and offering greater consistency and reliability in research outcomes.

References

  1. I. Zherko et al Common data environment for source vocabularies mapping, Conference: OHDSI European Symposium 2022
  2. https://github.com/OHDSI/Vocabulary-v5.0/releases/tag/v20240229_1709217174.000000
  3. O. Zhuk et al Contribution to the OHDSI Vocabularies, User-Level QC and a New Entity Mapping System SSSOM, Conference: OHDSI European Symposium 2023
  4. N. Matentzoglu et all A Simple Standard for Sharing Ontological Mappings (SSSOM), Database, Volume 2022, 2022, baac035, https://doi.org/10.1093/database/baac035
⚠️ **GitHub.com Fallback** ⚠️