CDISC - OHDSI/Vocabulary-v5.0 GitHub Wiki
The Clinical Data Interchange Standards Consortium (CDISC) is an open, non-profit organization that develops and supports global data standards to improve the quality and interoperability of data in medical research and healthcare. CDISC partners with NCI Enterprise Vocabulary Services (EVS) to develop and support controlled terminology for all CDISC standards initiatives.
Currently, only the CDISC Controlled Terminology supporting Study Data Tabulation Model (SDTM) is represented as OMOP vocabulary. NCI maintains it as part of the NCI-metathesaurus.
All the CDISC sources were obtained from NCI-metathesaurus.
Information about concepts can be found in the mrconso table.
Table 1 – Fields from NCIm mrconso table used for CDISC integration as OMOP vocabulary
mrconso field name | mrconso field definition | The way it was used during the CDISC OMOP conversion |
cui | Unique concept identifier | Used during mappings integration.
Can potentially be used to obtain concept hierarchy |
scui | The source asserted concept identifier | Represented as concept_code |
code | Most useful source asserted identifier (if the source vocabulary has more than one identifier), or a Metathesaurus-generated source entry identifier (if the source vocabulary has none) | Used to identify the second part of the concept_code, which is represented in the ‘str’ field |
str | All set of concept names within one scui | Used to obtain concept_names, second parts of concept_codes, synonyms |
The procedures for transforming Concepts from the source to the OMOP Standard Vocabularies can be found on the OHDSI GitHub.
Single concept’s characteristics (concept_code, concept_name, concept_synonym) are represented within one scui in the mrconso table. According to the prioritization rules, concept_name was chosen from those defined in the ‘str’ field. All other ‘str’ values within the same scui having sab = ‘CDISC’, were taken as synonyms.
Rules to define a concept_name:
- When code like ‘%CD’ exists – take ‘str’ with code without ‘%CD’ as a name (Table 2)
- When no code like ‘%CD’ is present – take ‘str’ where ‘mrconso.tty’ =’ PT’ and ‘mrconso.sab’ = ‘CDISC’ (not NCI) as a concept_name (Table 3)
- When no ‘PT’ with sab = ‘CDISC’ exists – take ‘str’ with tty = ‘SY’ and ispref = ‘Y’ (Table 4)
In most cases ‘scui’ was taken as concept_code. In some cases (where code like ‘%CD’ ‘code’ like ‘%CD’ exists) concept_code is complex and is represented as ‘scui’ concatenated with code as ‘code’ LIKE ‘%CD’ is considered to be the real source code, however, not all the concepts have it.
See examples below.
Table 2 – code like ‘%CD’ exists
mrconso fields | CDISC OMOP attribute | ||||
scui | tty | sab | code | str | |
C198232 | SY | CDISC | C198232 | Air Pressure | concept_synonym |
C198232 | PT | NCI | C198232 | Air Pressure | - |
C198232 | PT | CDISC | SDTM-AUTEST | Air Pressure | concept_name |
C198232 | PT | CDISC | SDTM-AUTESTCD | AIRPRSSR | 2nd part of the concept_code |
In the example above, AIRPRSSR was assigned as the second part of the concept_code (C198232-AIRPRSSR), and Air Pressure – as the concept_name. Other str, except that for NCI, are synonyms.
Table 3 – code like ‘%CD’ does not exist
mrconso fields | CDISC OMOP attribute | ||||
scui | tty | sab | code | str | |
C40407 | SY | CDISC | C40407 | Wilms Tumor of the Kidney | concept_synonym |
C40407 | SY | CDISC | C40407 | Wilms' Tumor of the Kidney | concept_synonym |
C40407 | SY | CDISC | C40407 | Embryonal Nephroma | concept_name |
C40407 | SY | CDISC | C40407 | Nephroblastoma | concept_synonym |
C40407 | SY | CDISC | C40407 | Renal Wilms' Tumor | concept_synonym |
C40407 | PT | CDISC | C40407 | NEPHROBLASTOMA, MALIGNANT | concept_name |
C40407 | PT | NCI | C40407 | Kidney Wilms Tumor | concept_synonym |
In the example above, C40407 was assigned as the concept_code, and NEPHROBLASTOMA, MALIGNANT – as concept_name. Other str, except that for NCI, are synonyms.
Table 4 – no ‘PT’ with sab = ‘CDISC’ exists
mrconso fields | CDISC OMOP attribute | |||||
scui | tty | ispref | sab | code | str | |
C62017 | SY | Y | CDISC | C62017 | Type 1 2nd degree AV Block | concept_name |
C62017 | PT | Y | NCI | C62017 | AV Block Second Degree Mobitz Type I | - |
C62017 | PT | Y | NCI | C62017 | Mobitz Type I Second Degree AV Block | - |
C62017 | PT | Y | NCI | C62017 | AV Block Second Degree Möbitz Type I | - |
In the example above, C62017 was assigned as the concept_code, and Type 1 2nd degree AV Block, – as concept_name.
All the CDISC concepts are not standard.
Concept_class and domain_id will be obtained through mapping of Attributes from NCI mrsty table to SNOMED concept_classes and domains. In cases when one concept has several (2 or more) attributes and as a result several classes and/or domains – default class and domain are assigned (concept_class_id = ‘Observable Entity’, domain_id = ‘Observation’).
Mappings of CDISC attributes to SNOMED concept classes and domains are available from here.
domain_id | count |
Observation | 16897 |
Measurement | 8158 |
Spec Anatomic Site | 1792 |
Procedure | 620 |
Unit | 571 |
Condition | 455 |
Geography | 276 |
Device | 208 |
Provider | 41 |
The only type of relationship introduced at the time of vocabulary integration was mapping ones: ‘Maps to’ and ‘Maps to value’. The majority of such relationships are uncurated, source-derived mappings. However, there is a portion of pre-selected codes curated manually.
More mapping information (i.e. provenance and directionality) can be found in the concept_relationship_metadata table. No other lateral (intra-vocabulary) relationships were introduced.
CDISC Concepts are non-Standard Concepts and therefore do not participate in the hierarchy of the CONCEPT_ANCESTOR table. No other hierarchical (intra-vocabulary) relationships were introduced.
All the CDISC concepts are non-Standard. That means they have to be mapped to the corresponding Standard Concepts using the CONCEPT_RELATIONSHIP table ("Maps to" and occasionally "Maps to value" records). Most of them are mapped to single Concepts, generating one-to-one records, but some of them create multiple records or have mappings to other domains.
From the ETL perspective it is necessary to use sequential joins to obtain proper mappings:
For cases when you work with CDISC names fields in your data, the querying scenarios are
- JOIN to concept_name or JOIN to concept_synonym
For cases when you work with CDISC codes fields in your data, the querying scenarios are:
- JOIN to the second part of the concept_code or JOIN to the first part of the concept_code or JOIN to concept_name or JOIN to concept_synonym