General Thoughts - OHDSI/ETL--PulmonaryHypertensionRegistries Wiki

CDM version

Currently, three CDM versions, 5.3, 5.4, and 6.0, are available as candidates for conversion. However, v6.0 is not fully supported by the OHDSI tools, and it is not ready for mainstream use.

Which version you use (v5.3 or v5.4) will be based upon the version of the target harmonization dataset, i.e., if you already have several datasets converted to v5.3, there is no sense in converting a new one to v5.4 and vice versa.

If you are starting building your harmonization project, we would recommend picking v5.4, since:

Multiple sources

Sometimes a registry is formed by several sources that come from different case report forms. In such cases, different CRF forms should be handled differently during ETL so that no peculiarity is missed.

Type concepts

Type concepts indicate the provenance of a record in the OMOP CDM. 32809 - Case Report Form and 32879 - Registry can be used as type concepts for registry data.


The CDM is a patient-centric model, so the person table is essential. Here are some aspects worth paying attention to.

One working on harmonization projects should always keep in mind that person source value should be unique across all aggregated datasets. If source datasets are in SDTM, unique subject identifiers (USUBJID) are preferred over just subject identifiers (SUBJID). For other types of sources, the idea of USUBJID can be borrowed - use a combination of a study identifier and a unique patient identifier within the study as a person source value.

Among mandatory person attributes, year of birth should be of top priority. However, sometimes in the source data, explicit patient's year of birth is absent, but age is present instead. In such cases, a year of birth should be calculated based on age and some reference date. This reference date can be when informed consent was signed, the first dose of a study drug was administered, a baseline visit occurred, etc. If you are dealing with SDTM, you can use an explicit subject reference date (RFSTDTC) if it is present in the Demographics domain.

We recommend substituting original un-mappable values with some generic ones for race and ethnicity. This will allow you to do some analysis based on source values rather than concept ids since the latter is 0. For example, one dataset can have 'Mixed' or 'Mixed race' as a value for the race, the other - '2 or more races', etc. This type of race is non-standard and thus cannot be populated in race_concept_id. In this case, it makes sense to substitute original values with one, let us say 'MIXED', and put it into race_source_value. This will allow you to pick all mixed raced patients out for analysis from all studies without writing complex regular expressions. In such a manner, 'OTHER' and 'UNSPECIFIED' categories for races and ethnicities can be created if it adds value for analysis.

Negative and unknown events

In clinical trials and registries, it is common to store 'negative' events that have not been done or have not occurred. However, it is not the case for the OMOP CDM. The CDM handles 'negative' events differently because of the observation period. If there is no record of an event during this time, this event has not happened. This statement, though, should not be extrapolated for 'negative' events outside of the observation period. Although such an approach is refined for observational studies, the CDM was initially created for. Sometimes it cannot satisfy the needs of researchers who work with registries and clinical trial data. Thankfully, the Athena concept set allows us to store 'negative' events if there is a strong use case for it.

All 'negative' events disregarding their source domains are stored in the Observation table. If there is a valid standard concept summarizing the absence of an event, then it should be used, i.e., 44811030 - No history of deep vein thrombosis, 45763681 - No history of depression, etc. If there is no such concept, the event itself is stored as value as concept id with the observation concept id among the following: 40481925 - No history of clinical finding in subject, 4166732 - No history of procedure, or 4032324 - No history of.

For 'Unknown' events, the idea proposed above is valid. Among Athena offerings for observation concept id the following concepts worth considering: 4270455 - Procedure status unknown, 4287024 - Medical history unknown.

PH classification


For PH registries, it is crucial to know if death was related to the study disease, in our case to pulmonary hypertension, or not. This fact can be captured in the source dataset along with other causes of death, often as a flag value (i.e., PHREL or PAHREL equals to 'Y' or 1), whereas the granular cause of death is stored in a different column or table. Generally, it is worth carefully inspecting all the source tables for causes of death since they can be scattered across several tables. For example, in SDTM, cause of death can be found in Death Details domain and/or in Disposition Events with appropriate DSTERM, Adverse Events with AEOUT = 'FATAL', and Supplementary domains, i.e., SUPPDS, etc.

So in the ultimate case, a patient can have more than two causes of death (i.e., a relation to the PH and a granular condition resulting in death). Here we recommend adding as many records to the death table for a person as needed to keep all the details. The date of death, in this case, should be the same for all records for this patient.

Obviously, there is no reason for storing source records indicating only a fact of death without a reason for it if there is also a record with the reason. In such cases the latter should go into the CDM. Alternatively, if the only information available is that a particular patient died sometime, this should be taken into the CDM.