Common Data Environment (CDE) for OMOP Vocabularies mapping, harmonization and reuse - OHDSI/Vocabulary-v5.0 GitHub Wiki
Common Data Environment (CDE) for OMOP Vocabularies mapping, harmonization and reuse
Observational health data often contain similar information from various sources, such as controlled medical terminologies or free-text sources, leading to mapping discrepancies within OMOP vocabularies and across different datasets (Fig.1).
Fig. 1 – Example of mapping discrepancies between different ICD vocabularies
Curation of mappings can be a time-consuming task that requires expertise in both medical and terminology domains. By optimizing mapping strategies, efficiency and accuracy can be greatly improved.
The OHDSI Vocabulary team has been working on developing automated processes for updating mapping targets and creating a common data environment (CDE) for custom data sets. This project aims to further enhance our efforts by creating a universal CDE that can accommodate all types of source data.
The main focus is to create a framework for organizing various source data, storing mapping candidates from different sources, and facilitating decision-making on the best mappings to use. To address the indicated challenges, we have created the universal CDE and implemented it for the ICD family vocabularies refresh during the ICD overhaul (February 2024 Vocabulary release [2]) and August 2024 release.
CDE is a universal way of data organization for vocabularies, groups of vocabularies (like ICD), and ETL data, irrespective of the origin. The CDE is used to create more precise and effective mappings by reusing existing mappings for close semantic matches and by effective detection and resolving mapping discrepancies (Fig.2).
Fig. 2 – The common data environment work- and data flow
The idea behind the CDE is to create groups of concepts that share semantic entities (clinical facts). Some of them are used to produce mappings, while others are used to support mapping control:
- Strict group – concepts with the same or insignificantly different (according to OMOP use cases) meaning. This kind of group has a name that reflects the meaning of the group most precisely. This group name is used for mapping.
- Medium group – concepts with close but not identical meanings. Medium groups are used to align mappings for the related concepts.
- Broad group – groups by other ways of grouping (e.g. hierarchical ICD categories or other classifications).
The CDE contains concept codes and concept names, mapping with their sources and metadata. Mapping candidates originate from automated and manual mapping tables, automatic mapping replacements, external mapping sources, such as SNOMED to ICD10CM equivalence tables, and a community contribution. The format of metadata in CDE is SSSOM-compatible [3, 4] and includes predicate_id (‘Maps to equivalent’, ‘Maps to uphill’, ‘Maps to downhill’, etc.), mapping_tool and mapping_justification. The actual CDE structure is represented in Table 1.
Table 1. CDE structure
Field name | Content |
concept_code | source_code or its equivalent |
concept_name | source_code_description or its equivalent |
hierarchy | ancestors of the source concept, NULLable |
vocabulary_id | source_vocabulary_id or its equivalent |
concept_class_id | concept_class_id, NULLable |
domain_id | domain_id, NULLable |
counts | used only for custom data sets, NULLable |
group_id | strict_group_id assigned during concepts semantic grouping |
group_name | strict_group_name assigned during semantic grouping, which is used for mappings |
group_code | vocabulary_id:source_code pairs array |
medium_group_id | medium_group_id assigned during concept grouping |
broad_group_id | broad_group_id assigned during concept grouping |
relationship_id | relationship_id (‘Maps to’, ‘Maps to value’) |
relationship_predicate_id | relationship_id predicate (‘equivalent’, ‘uphill’, ‘downhill’) |
confidence | how confident the reviewer is in the mapping (0,9; 1,0, etc.) |
decision | ‘1’ if the mapping candidate was chosen as suitable, NULL – if the mapping candidate was rejected |
decision_date | date of the decision about mapping candidate suitability |
target_concept_id | concept_id of the Standard target concept |
target_concept_code | concept_code of the Standard target concept |
target_concept_name | concept_name of the Standard target concept |
target_concept_class_id | concept_class_id of the Standard target concept |
target_sandard_concept | standard_concept field of Standard target concept |
target_invalid reason | invalid reason field of Standard target concept |
target_domain_id | domain_id of the Standard target concept |
target_vocabulary_id | vocabulary_id of the Standard target concept |
mappings_origin | information about the mapping sources |
mapping_valid_start_date | relationship valid_start_date |
mapping_valid_end_date | relationship valid_end_date |
We are currently implementing the developed CDE for the mapping refresh and harmonization across ICD family vocabularies.
- We gathered the ICD10, ICD10CM, and ICD9CM data from the automated and manual mapping tables. For local versions such as ICD10GM, ICD10CN, KCD7, and CIM10, only data from manual tables was inserted, as the majority of mappings for local versions are directly reused from ICD10 vocabulary. Unique pairs of source_code and target_concept_id were inserted, with the mapping origin preserved in the ‘mappings_origin’ field.
- For mapping refresh, we additionally include potential replacement mappings for source codes mapped to non-standard or invalid concepts. These mappings were proposed based on valid ‘Concept poss_eq to’ relationships or other suitable links.
- The concepts are then grouped into strict groups based on criteria, such as identical source_code_description or identical mappings for unprocessed source codes. The script groups the concepts as well as assigns a group name.
- Once the concepts are grouped, we transfer the group_name, group_code, and mappings to table editor software for mapping curation and decision-making (Table 2). Medical terminologists could regroup the concepts if needed in the process of manual curation.
Table 2. Example of the CDE at the manual work step
After the initial grouping and mapping curation, all members of a group are assigned the same target_concept_id, which was then implemented through each vocabulary load_stage (Fig.3). During the ICD overhaul this process resulted in the creation of 116,705 semantic groups across ICD vocabularies and the resolution of 3,032 mapping conflicts. 79 groups contained mapping incorporated from the community contributions, resulting in the harmonization of mappings of 350 concepts.
Fig.3 Representation of common data environment implementation result
Application of the Common Data Environment approach in source data processing can greatly improve efficiency in repetitive processes of data mapping. This approach can be efficiently implemented for both vocabularies and custom data, enhancing the accuracy of data processing, and ultimately leading to the creation of higher-quality OMOP CDM instances and offering greater consistency and reliability in research outcomes.
References
- I. Zherko et al Common data environment for source vocabularies mapping, Conference: OHDSI European Symposium 2022
- https://github.com/OHDSI/Vocabulary-v5.0/releases/tag/v20240229_1709217174.000000
- O. Zhuk et al Contribution to the OHDSI Vocabularies, User-Level QC and a New Entity Mapping System SSSOM, Conference: OHDSI European Symposium 2023
- N. Matentzoglu et all A Simple Standard for Sharing Ontological Mappings (SSSOM), Database, Volume 2022, 2022, baac035, https://doi.org/10.1093/database/baac035