Introduction - OHDSI/ETL--PulmonaryHypertensionRegistries GitHub Wiki

Ideally, an ETL translates the source data into a target data module as accurately as possible without any data loss or alteration. However, it is not possible in practice since data cleaning is always happening to some extent.

Generally, in the ETL, we recommend dropping only records that a researcher would have dropped anyway (according to SAP, protocol, etc., or CDM restrictions, i.e., absence of patient identifier) and keeping all others. We do not correct errors in the lab results (delegating this to the researcher), except blunders and incomplete results, substituted with NULLs. The latter lets us keep the fact that a measurement was taken, although the result of that measurement is not available. To sum up, the data cleaning principle is to be transparent for the researcher without significant alteration of the source data.

Background

The data available on PH subgroups are limited; databases of patients with PH and its subgroups are often small and come from disparate regions, without the possibility of readily pooling the data. Here are some thoughts on why raw PH data should be converted from the original data structure to OMOP CDM.

First of all, the ultimate goal for a researcher is to get more robust conclusions on real-world data. As the law of large numbers suggests, the larger the sample size, the more likely its characteristics will be close to the population ones. In other words, it is worth expanding the data. A cost-effective way to achieve this is to combine all available data sources [1].

However, it is difficult to merge data without losing information since each dataset has its original purpose, objectives, structure, and terminology. A logical solution to address this problem would be to store data in a standardized format.

Moreover, data harmonization and federation have become a new paradigm for the analysis of health data and are considered as the future for research collaboration and real-world evidence generation in rare diseases. Successful examples of such a federated data network are the European Health Data Evidence Network (EHDEN) or the federated network of PH registries PHederation, which both adopted the OMOP CDM as their core data model. PHederation is the main motivating factor for sharing PH registry mapping conventions in a GitHub repository.

Four key advantages of standardizing to the OMOP CDM are:

The CDM is a patient-centric model
OMOP CDM is data agnostic, so users do not need to understand all database-specific schema details
it relies on firmly controlled terminology that makes datasets comparable and concepts accessible for researchers
its fixed structure simplifies application development (in a replication analysis, it has been reported that up to 80% less programming time was required in OMOP CDM than in the raw data [2])

Therefore, the mapping of PH registry databases to the OMOP CDM allows researchers to use a broader base to generate more robust conclusions, as they can perform analyses across the many disparate real-world data assets available in OMOP CDM format [3].

CDM version

Currently, three CDM versions, 5.3, 5.4, and 6.0, are available as candidates for conversion. However, v6.0 is not fully supported by the OHDSI tools, and it is not ready for mainstream use.

Which version you use (v5.3 or v5.4) will be based upon the version of the target harmonization dataset, i.e., if you already have several datasets converted to v5.3, there is no sense in converting a new one to v5.4 and vice versa.

If you are starting building your harmonization project, we would recommend picking v5.4, since:

it is the latest version
changes between these two versions are not huge (they can be found here)
it is not tremendously complicated to create a downgrade converter from v5.4 to v5.3 by using, for example, views
if a source PH registry is multinational, v5.4 is handy since it has the country fields in the location table

Type concepts

Type concepts indicate the provenance of a record in the OMOP CDM. 32809 - Case Report Form and 32879 - Registry can be used as type concepts for registry data.

Pulmonary Hypertension - General Information

Pulmonary hypertension with all its clinical complexity resulted in the development of several ways to recognize and reflect the relevant clinical entities in schemas, called classifications. Design principles and internal ontological crosswalks for the most prominent classifications are described in detail in the respective chapter. One thing to note here is that while working with PH registries be prepared to deal with classification concepts and free text as coded entities rather than with vocabularies common in electronic health records (EHR) or claims datasets. More on this topic is here.

Registry Data - General Considerations

As it was mentioned above, Pulmonary Hypertension datasets rely on classifications and free text so do the registries in general.

Multiple sources

Sometimes a registry is formed by several sources that come from different case report forms (CRF). In such cases, different CRF forms should be handled differently during ETL so that no peculiarity is missed.

Causality and linked facts

Registry data (as well as, clinical trials data) abounds in qualifiers and interlinked facts, like seriousness and causality for adverse events, pre- and postexercise lab data, etc. To find out how to store such data, see How to Link Facts chapter.

Negative and unknown events

In clinical trials and registries, it is common to store 'negative' events that have not been done or have not occurred. However, it is not the case for the OMOP CDM. The CDM handles 'negative' events differently because of the observation period. If there is no record of an event during this time, this event is considered absent. Nevertheless, this statement should not be extrapolated for 'negative' events outside the observation period. Although such an approach is refined for observational studies, it cannot satisfy the needs of researchers who work with registries and clinical trial data. Thankfully, the Athena concept set allows us to store 'negative' events.

All 'negative' events disregarding their source domains are stored in the Observation table. If there is a valid standard concept summarizing the absence of an event, then it should be used, i.e., 44811030 - No history of deep vein thrombosis, 45763681 - No history of depression, etc. If there is no such concept, the event itself is stored as value as concept id with the observation concept id among the following: 40481925 - No history of clinical finding in subject, 4166732 - No history of procedure, or 4032324 - No history of.

For 'Unknown' events, the idea proposed above is valid. Among Athena offerings for observation concept id the following concepts worth considering: 4270455 - Procedure status unknown, 4287024 - Medical history unknown.

Having such information in the CDM might be helpful if you want to know how many patients have been tested for something, how many have not, and what percentage of the population has unknown status if they have been tested or not.

References

[1] Cheng HG, Phillips MR. Secondary analysis of existing data: opportunities and implementation. Shanghai Arch Psychiatry. 2014 Dec;26(6):371-5. doi: 10.11919/j.issn.1002-0829.214171. PMID: 25642115; PMCID: PMC4311114.

[2] Matcho A, Ryan P, Fife D, Reich C. Fidelity assessment of a clinical practice research datalink conversion to the OMOP common data model. Drug Saf. 2014 Nov;37(11):945-59. doi: 10.1007/s40264-014-0214-3. PMID: 25187016; PMCID: PMC4206771.

[3] Biedermann P, Ong R, Davydov A, Orlova A, Solovyev P, Sun H, Wetherill G, Brand M, Didden EM. Standardizing registry data to the OMOP Common Data Model: experience from three pulmonary hypertension databases. BMC Med Res Methodol. 2021 Nov 2;21(1):238. doi: 10.1186/s12874-021-01434-3. PMID: 34727871; PMCID: PMC8565035.