Community contribution: drug vocabularies - OHDSI/Vocabulary-v5.0 GitHub Wiki
This document is intended to be used for custom drug vocabulary development, refresh or using machinery for mapping.
Authors: Anna Ostropolets, Alexander Davydov, Oleg Zhuk, Timur Vakhitov and Christian Reich
Version: 1.2
Date of last modification: 01/06/2024
Acknowledgement: We want to thank the Vocabulary Committee and community (especially Tom Seinen) for feedback on this document.
Scope and content of this document
This document outlines a process of incorporation of a new drug vocabulary into the OHDSI Standardized Vocabularies. It involves creating new vocabulary with its codes and mapping it to a standard and creating new standard terms and relationships if drugs are absent in the OHDSI Vocabularies. Repository of added drugs including all hierarchical categories and fully integrated into RxNorm is called RxNorm Extension and is built and maintained within OHDSI.
The task of incorporating and standardizing (and maintaining that status) a source vocabulary into RxNorm Extension involves several aspects:
- Mapping drug codes to existing concepts in RxNorm or RxNorm Extension based on attribute equivalence.
- Creating new RxNorm Extension concepts (Drug Products, Dose Forms, Ingredients, Brand Names, etc.) if no valid equivalent is available. If new drugs are added, the hierarchical concepts above the new node in the hierarchy will be added as well.
- Rearranging RxNorm Extension when a new update of RxNorm happens to overwrite or otherwise affect existing RxNorm Extension concepts (e.g., when a drug is approved by the FDA after it already had market approval elsewhere) or make changes in the hierarchy.
The process involves (i) an initial vocabulary-specific step transforming drugs and their attributes into a standard representation ("input tables"), (ii) an integration step incorporating these into RxNorm Extension (often called the "boiler") which produces intermediate stage tables, and (iii) creation of the final basic tables that are then released through Athena.
In Step I, input tables have to be prepared from the source vocabulary-specific information: drug_concept_stage for concept information, ds_stage for drug strength information, internal_relationship_stage for relationships between the stage concepts, pc_stage for packs, relationship_to_concept for mapping ingredient, brand, dose form, supplier and unit concepts to standard RxNorm and RxNorm Extension.
In Step II, the input files are used to map the incoming drugs using their attributes ("boiled"), rather than using lexical methods. It manages the ambiguity of the input attribute mapping described above, but also considers the amount of confidence in a target concept (often dubbed "social life of a concept"). For example, a drug that exists in many markets in the same combination of ingredient, dose form and strength, but may differ in the lower hierarchy such as branding, suppliers and packaging, will be preferred in a fuzzy mapping situation over a drug that has little manifestation in any of the various drug markets.
Figure 1. Mapping or de-novo creation of a concept (vertical axis) according to its attributes (horizontal axis).
As a result, the quality of the product heavily depends on the quality and precision of the attribute mappings fed into the system and the accuracy of the existing target RxNorm/RxNorm Extension concepts and hierarchy. If the latter deteriorates, no level of input quality can prevent the generation of issues.
The "boiler", i.e., the integration into RxNorm Extension, can be also run in a"mapping only"_ mode without creating new concepts (please refer to Step 2.2). This is useful if a source vocabulary is of low quality (e.g., drug codes are only provided as abbreviated descriptions), but a high-quality vocabulary is already integrated into the system for the respective country or drug market.
Finally, Step III is a vocabulary-agnostic step that includes building hierarchies, assigning concept_id, and otherwise preparing the tables for release.
In this document, we will describe the steps for a) creating a new drug vocabulary and incorporating it into the OHDSI Vocabularies, b) mapping your vocabulary to the standard terms without creating new RxNorm Extension concepts ("mapping only" mode) and c) refreshing existing drug vocabulary.
Initial development process and usage for mapping to RxNorm and RxE
Prerequisites
- PostgreSQL database
- Postgres extensions needed for the functions
- plpython3u
- plpgsql
- Vocabularies functions (described later throughout the document)
Step 0: Schema preparation
In this step, you need to prepare a schema that contains the input tables and the basic Vocabularies tables. Subsequent scripts assume that both these groups of tables are in the same schema. The script for input table DDL can be found here.
If you are using "mapping only" mode, you may relax the requirements such as extending concept_code varchar(50)to varchar(255). This is not applicable to the vocabularies that are planned to be incorporated into the OHDSI Standardized Vocabularies.
- Create standard Vocabularies tables (hereon, "basic tables") and populate them from Athena.
You can use DDL from this GitHub file.
You then will upload the corresponding tables from the Vocabulary bundle downloaded from Athena (or re-use the Vocabularies you have already uploaded). You need to have a full copy of RxNorm and RxNorm Extension at the very minimum.
- Add your vocabulary
You need to add your vocabulary to the vocabulary and concept tables (the name should match the vocabulary_id you used in drug_concept_stage and elsewhere).
INSERT INTO concept
(concept_id, concept_name, domain_id, vocabulary_id, concept_class_id, standard_concept, concept_code, valid_start_date,
valid_end_date, invalid_reason)
SELECT 100 (an incremenetal value that does not exist in Vocabulary table),
'YourVocabName',
'Drug',
'Vocabulary',
'Vocabulary',
NULL,
'OMOP generated',
TO_DATE_('19700101', 'yyyymmdd'),
TO_DATE_('20991231', 'yyyymmdd'),
NULL;
INSERT INTO vocabulary
(vocabulary_id, vocabulary_name, vocabulary_concept_id, vocabulary_reference)
SELECT 'YourVocabName',
'YourVocabName',
100, -- (an incremenetal value that does not exist in Vocabulary table)
'Stub';
Alternatively, you can create this standard function, which also adds the attributes required for a release in Athena, and execute the command below. This will require adding a row into vocabulary_conversion table as well (vocabulary_id_v4 can be any integer higher than 155).
You will need to set latest update date for the vocabulary in order for scripts to work.
We use the function here. Once you execute the function, set Latest Update for both your vocabulary and RxNorm Extension:
DO $_$
BEGIN
PERFORM VOCABULARY_PACK.SetLatestUpdate(
pVocabularyName => 'YourVocabName',
pVocabularyDate =>current_date,
pVocabularyVersion => 'YourVocabName||current_date',
pVocabularyDevSchema => 'YourSchemaName'
);
PERFORM VOCABULARY_PACK.SetLatestUpdate(
pVocabularyName => 'RxNorm Extension',
pVocabularyDate =>current_date,
pVocabularyVersion => 'RxNorm Extension||current_date',
pVocabularyDevSchema => 'YourSchemaName ',
pAppendVocabulary => TRUE
);
END $_$;
Step 1: Building input tables
Step 1.1: Data preparation
Drug vocabularies oftentimes contain the codes for what the OHDSI considers to be drugs and devices. You can incorporate both of them: drugs will become non-standard and will map to RxNorm/RxNorm Extension, devices will become standard.
Data preparation step includes identifying devices in your vocabulary.
Concepts that belong to the source vocabulary, but do not belong to Drug domain by OMOP CDM conventions, should be classified as 'Device'. A Drug is a biochemical substance formulated in such a way that when administered to a Person it will exert a certain physiological or biochemical effect. The following products are not considered Drugs, but Devices:
- Diagnostic radiopharmaceuticals.
- Contrast material for imaging (barium sulfate, gadolinium, etc.).
- Nutritional products and supplements, including infant feeding. In reality that results in the slightly arbitrary and in some cases difficult to ascertain situation that solutions of salts are Drugs (hydrating patients and maintaining ionic balance), while the addition of nutrients such as glucose or vitamins makes them devices (feeding patients).
- Parenteral nutrition (aminoacids and/or lipid mixes).
- Solution for dialysis, catheter maintenance etc.
- Products directly derived from blood (erythrocytes, plasma, serum, thrombocytes), while immune globin or albumin preparations, monoclonal antibodies, coagulation factors and homeopathy derived from blood are Drugs.
- Transplants of any kind.
- Artificial saliva, artificial tears, and lubricants.
- Non-medicated cosmetics (creams, ointments, soaps, deodorants, shampoos), including SPF sunscreens.
- Glucometric strips, needles, tubes with anticoagulants, etc.
- Surgical materials like bone cement.
- Hemostatics materials (cellulose, flour, collagen, etc.).
- Disinfectants for inert surfaces.
Animal drugs can be handled as Drugs or Devices, depending on what their role in patient data can be expected to be. Note that only concepts from Drug domain can have attributes.
Aside from devices, there are several categories of drugs that are more complex and less standardized:
- Vaccines
- Insulins
- Combination of drugs, such as oral contraceptive pills, some antibiotics
These types of drugs require additional scrutiny when identifying their appropriate ingredients and/or dose forms. If you are using "mapping only" mode, these drugs may be easier to map manually than to let them go through the machinery.
Step 1.2: Parsing source data
To correctly implement a vocabulary in CDM and find or build a counterpart for each source drug concept the following attributes must be extracted:
- Ingredients : active substance(-s) in pharmacological preparation.
Example: Aspirin, Trastuzumab, Ibuprofen.
- Dose form : form in which drug is administered to patient.
Example: Oral Tablet, Transdermal System, Injectable Suspension.
- Brand (marketed) name
Example: Nurofen, Kadcyla, Prevenar 13.
- Supplier (manufacturer) name
Example: Reckitt Benckiser, TEVA, Ratiopharm etc). Generally, local branch and general name of international entity are considered same ("Teva Europe" is the same as "Teva").
-
Dosage/strength of active substance, total volume of liquid drug or number of actuations for dosed inhalers or applicators
-
Information about combination pack contents
Example: contraceptives calendar packs with varying dosages and ingredient combinations.
- Box size : number of unique package units (like bottles, blisters or syringes) in a packaged box.
If the source does not provide separate attributes, you would need to parse the drug strings to subsequently map attributes to RxNorm/RxNorm Extension attributes. Every drug must have at least one ingredient, other attributes are optional.
Step 1.3: Mapping attributes
Mapping of attributes to attributes in RxNorm/RxNorm Extension/UCUM (do not use SNOMED or other vocabularies) should be stored in relationship_to_concept table. If no appropriate mapping exists, you do not need to enter such attributes in the table. You must ensure that no mappings exist as such attributes will be created as new standard attributes in RxNorm Extension.
You will populate:
relationship_to_concept.concept_code_1 with the codes in your source vocabulary, relationship_to_concept.vocabulary_id_1 with the name/id of your vocabulary, relationship_to_concept.concept_id_2 with the concept_id of the target mappings in RxNorm/RxNorm Extension/UCUM.
relationship_to_concept.precedenceshould be filled for all rows and reflects your confidence level. In case mapping is 100% correct (one to one match), precedence = 1. In case when you are unsure about mapping (different salts, biological species are available), use incrementing precedence with a separate row per a mapping. This is more like 'probability of mapping' to this concept. Later BuildRxE will combine mappings in accordance with mapping precedence.
relationship_to_concept.conversion_factor is used to transform your units into standard units (for example, place 1000 for mappings of 'gram' to the standard UCUM unit of 'mg').
Ingredients are mapped to Standard Valid RxNorm or RxNorm Extension concepts with concept_class_id = 'Ingredient' one to one. If ingredient is given as a mix (e.g., co-dried gel of Magnesium Carbonate and Aluminium Hydroxide), it should be split in multiple entities with distinct new codes; each component of the mix must be mapped to standard ingredient.
Several mappings with incremental precedence should be used if:
- Source ingredient is an ion (like calcium, iron, zinc, etc.), which should be mapped to all its salts.
- Source ingredient is an herbal extract, which should be mapped to all suitable standard concepts;
- Target vocabularies contain logical duplicates among standard ingredients. This is rare. Example: RxNorm contains both 19026739 Pantothenic Acid and 19088079 pantothenate as separate standard ingredients (as of April 2023).
Dose Forms are commonly mapped to multiple Valid RxNorm or RxNorm Extension concepts with concept_class_id = 'Dose Form' with precedence. Modified release forms should be first mapped to corresponding forms in RxNorm vocabulary (like Delayed Release Oral Capsule), and then to more generic forms (Oral Capsule) with lower precedence.
Brand Names are mapped to Valid RxNorm or RxNorm Extension concepts with concept_class_id = 'Brand Name' using precedence if needed.
Units should be mapped to Standard Valid UCUM concept with concept_class_id = 'Unit'. Weight units should be converted to milligram, volume units should be mapped to milliliter, molar - to millimole with the right conversion factor. It is highly desirable to only use units that are in use by Standard native RxNorm concepts. Query drug_strength table for a distinct list.
Ingredients, brand names, suppliers, dose forms must be mapped to RxNorm or RxNorm Extension vocabularies only. Units must be mapped to UCUM vocabulary.
Step 1.4: Fill in input tables
- Fill drug_concept_stage
drug_concept_stage is an analogue of concept table and should have the information about source Drug Products (all REAL source drugs which have ingredient(s), including drug packs) and source drug attributes (ingredients, brand names, dose forms, units, suppliers).
All names must be in English.
The fields are populated in the following manner:
-
drug_concept_stage.domain_id: 'Drug' or 'Device'
-
drug_concept_stage.concept_class_id
-
Drug Product (Branded Drug, Clinical Drug, Marketed Product etc.)
-
Dose Form
-
Brand Name
-
Ingredient
-
Supplier
-
Device (for source concepts falling outside of Drug category)
It may contain Branded or Clinical Drug Forms or Components, but if not, they will be derived. Note that units should not necessarily have an entry in the drug_concept_stage table. Instead, they should be used as verbatim. If the precise Concept Class of a Drug Product is relevant, it can be preserved in source_concept_class_id field.
Brand Names that are simple combinations of generic international name of active substance and manufacturer name (e.g. "Aspirin Bayer") should not appear as attributes for Drug Products. Manufacturer information should be stored as a concept with Supplier class.
- drug_concept_stage.vocabulary_id: insert name of your vocabulary
- drug_concept_stage.concept_code: insert source codes or newly created codes
- drug_concept_stage.standard_concept: NULL (unless there are >=2 similar attributes and we need to map them to one)
- drug_concept_stage.valid_start_date: date of source vocab release or current date
- drug_concept_stage.valid_end_date: 2099-12-31
- Fill internal_relationship_stage ,
You need to populate internal_relationship_stage with the links from a drug product to:
- a respective 'Ingredient' (drugs without ingredients will not be processed).
- 'Brand Name' if available (no more than one)
- 'Supplier' if available (no more than one)
- 'Dose Form' if available (no more than one)
All other relationships will be derived and ignored if they exist in the table. The relationships don't need to be symmetrical, only the one initiating from the Drug Concept is required.
If a drug product does not have an Ingredient attribute, it will be non-standard (as all source concepts) and not have any standard mapping target after processing. Supplier attribute will not be considered for concepts withoutds_stage orpc_stageentry since Marketed Product concepts can't exist without dosage information.
concept_code_1 for drug products and concept_code_2 for attributes you use here must match those in drug_concept_stage.
- Fill ds_stage :
This table contains the dose of each ingredient in each drug, as well as the box size. For drugs which have no strength information or have only for some of the containing ingredients, the ds_stage record must be omitted. '0' values in ds_stage are only allowed for inert drugs.
Drug ingredients should match those in internal_relationship_stage. If ingredients are mapped to the same one in relationship_to_concept their dosages should be summed up as for a single ingredient before processing. A drug should not contain ingredients in solid (amount) and liquid (numerator/denominator) form. This might be caused be either source data aberration or drug pack, which must be split into separate drug products and processed in pc_stage table. If denominator value is given, quantified drug will be created with given denominator value and unit as total volume.
- Inhalers, enemas or sprays that release certain dosage of active ingredient per "puff" should also be stored in numerator/denominator form with total number of actuations as denominator (e.g. X MG / Y ACTUAT).
- All drugs with fixed amount must have dosage in amount fields and all solutions must have dosage filled in numerator and denominator fields. When liquid drugs in data contain concentration information without volume, denominator_value field is left empty.
- Gases for inhalation must be put in numerator fields with % in unit field without filling denominator fields. It is the only acceptable use of percents in ds_stage. Make sure to convert everything else to mg/ml or mg/mg.
- Patches, drug implants and other forms that release molecules over a period of time (even extended release tablets or capsules) may also be stored in numerator/denominator form (e.g. X MG / Y HOUR).
- If dose form for the source concept is given as a soluble powder without a solvent (except powder inhalers), dosage is stored in amount field.
- For drugs that are administered in a form of oral liquid (solution, suspension, syrup), denominator value of 5 ML should be kept only when we are certain that the dosage is not given "per tbsp."; if 5 ML is not an actual fixed administered dose (e.g., a sachet or vial), it should be treated as concentration (denominator_value = NULL).
- Similarly, if you do not have information about actual volume (vial or bottle), do not put 1 ml in denominator. Instead, set denominator_value = NULL and denominator_unit = 'mL'. All drugs that have denominator_value will end up being Quantified Clinical/Branded Drug.
- Box size equal to 1 should be simply stored as NULL.
- Drugs cannot have differing information for denominators among different ingredients, skip dosage for some ingredients or have same ingredient with different dosages
This table contains the composition of a Clinical or Branded Pack (records for individual drugs): The Clinical or Branded Drug and, number of doses in each box and number of boxes in each pack. If it is a boxed Pack, it will also contain the box size since packs have no records in ds_stage like the other drug products. Packs are allowed to have branded drugs as components, although usually Brand Name is only attributed to the pack as a whole. Supplier may only be attributed to the pack as a whole. For example, for oral contraceptive pack 1592624 {21 (desogestrel 0.15 MG / ethinyl estradiol 0.03 MG Oral Tablet) / 7 (inert ingredients 1 MG Oral Tablet) } Pack [Isibloom 28 Day] you would have three separate entries in ds_stage: two for 'desogestrel 0.15 MG / ethinyl estradiol 0.03 MG Oral Tablet' and one for 'inert ingredients 1 MG Oral Tablet'. Pack itself will not have entries in ds_stage.
- Fill pc_stage
pc_stage is filled for every pack and shows the content of such where pack_concept_codeis filled with a drug_concept_stage.concept_code for a pack anddrug_concept_code is filled with a set of drug_concept_stage.concept_coderepresenting its components, where the latter are individual drugs.
For example, for oral contraceptive pack 1592624 {21 (desogestrel 0.15 MG / ethinyl estradiol 0.03 MG Oral Tablet) / 7 (inert ingredients 1 MG Oral Tablet) } Pack [Isibloom 28 Day] you would have 2 entries for two drugs with the amounts 21 and 7 respectively.
- Fill relationship_to_concept
Fill relationship_to_concept according to the rules in Step 1.3
Step 1.5: Perform input tables checks
Run the following checks:
- input_QA_integratable_E.sql (errors, should be fixed;query must return nothing)
- input_QA_integratable_W.sql (warnings, should be inspected)
- drug_stage_tables_QA.sql(contain some of the useful checks, should be inspected)
They are checking the integrity of the input, such as:
- Dosages without units
- Some ingredients with doses and some without in one drug
- Doses above 1 mg/mg or 1000 mg/mL
- Solid forms (e.g. tablets and capsules) containing liquid doses
- Ingredients used as brand names (Clopidogrel/Aspirin or Ibuprofen and Codeine)
- Undermapping of attributes leading to duplications
Step 2: Building intermediate stage tables (the boiler)
Step 2.1: Run the boiler
Run the BuildRxE script here (time of execution depends on the volume of concepts and can be 1h+). It is formatted as a procedure, but you can opt for running it step-by-step. If you are using "mapping only" mode, comment last block that drops tables.
BuildRxE or "boiler" takes input tables and transforms them into intermediate tables (concept_stage, concept_relationship_stage, drug_strength_stage, etc.) with your source vocabulary, mappings, and new RxNorm Extension concepts (if created).
Briefly, the script manages the ambiguity of the input attribute mapping by using 10% corridor for matching dosages, uses precedence of mappings for attributes you provided in relationship_to_concept and selects the set of attributes that are closest to the existing drugs within the groups we specified in the input tables (dose form, ingredient, supplier etc.).
When matching, it prefers a drug that exists in many markets in the same combination of ingredient, dose form and strength, but may differ in the lower hierarchy such as branding, suppliers and packaging over a drug that has little manifestation in any of the various drug markets.
If the input tables contain errors or logical discrepancies the execution will be halted. At the moment, you need to reverse engineer the script/tables to identify the error; please consult the Vocabulary Team.
Step 2.2 (optional): Run MapDrugVocab for mapping your vocabulary to RxNorm/RxE
If you are using "mapping only" mode, run the boiler and skip dropping the tables at the end of the script. Then follow by running MapDrugVocab here, which will automatically produce mappings from your source concepts to the most precise counterpart in RxNorm or RxNorm Extension.
The results of the script are automatically stored in concept_relationship_stage with relationship_id = 'Maps to'. Once you inspect the mappings (visually or programmatically), you do not need to perform any other steps; this is the final table.
Step 2.3: Perform intermediate table checks.
Since the subsequent steps are well-formalized, we have a set of functions that perform QA checks. Please take this script to create them. [as of June 2023 you will need to comment out lines 382, 683, 684.]
Then, run the following checks:
SELECT * FROM qa_tests.check_stage_tables (); -- query must return nothing
Step 3: Building basic tables (optional)
This is the final step that creates the snapshot of the standard basic Vocabularies tables as you download them from Athena. Here, the concept_ids are assigned, and hierarchies are created.
You do not have to perform these steps but can choose to do so if you want to examine the final output. First, run generic update script here. For simplicity, you may run the queries one by one or use the procedure as is. This script will update all the basic tables but concept_ancestor. [as of June 2023 you will need to comment out lines 221, 230, 247 and 248.]
To create hierarchy, you can create a function as here and run the following query [as of June 2023 you will need to comment out lines 23, 138, 146.]:
DO $_$
BEGIN
PERFORM vocabulary_pack.pManualConceptAncestor(
pVocabularies => 'RxNorm, RxNorm Extension');
END $_$;
Second, you may want to run vocabulary-agnostic scripts to get statistics and summary of changes. For that, you will need a stable copy of your original basic tables (pre-generic update) in a different schema. You need to execute the following queries (take some time and return rows related to the whole vocabulary bundle):
select * from qa_tests.get_checks();
select * from qa_tests.get_summary (table_name=>'concept',pCompareWith=>'YourStableVocabularySchemaName');
select * from qa_tests.get_summary (table_name=>'concept_relationship',pCompareWith=>'YourStableVocabularySchemaName');
select * from qa_tests.get_domain_changes(pCompareWith=>'YourStableVocabularySchemaName');
select * from qa_tests.get_newly_concepts(pCompareWith=>'YourStableVocabularySchemaName');
select * from qa_tests.get_standard_concept_changes(pCompareWith=>'YourStableVocabularyShemaName');
select * from qa_tests.get_newly_concepts_standard_concept_status(pCompareWith=>'YourStableVocabularySchemaName');
Final steps
If you use the scripts to propose incorporation of your drug vocabulary into the OHDSI Vocabularies, please submit to the Vocabulary Team your:
- link to the vocabulary or table containing the vocabulary
- your code (can be submitted as pull request)
- copy of input tables
- results of QA/QC and any additional notes on your review process
Vocabulary refresh
If you want to refresh your vocabulary, you will need to repeat steps Step 1-3 with new version of the vocabulary.
If you do not have your schema set up, you should set it up following instructions in Step 0 (you do not need to add a row into the vocabulary table as the vocabulary is already present, please use corresponding vocabulary_id for your input tables.
If you built input tables for this vocabulary before, you should re-use it, fix any bugs that have been discovered and add new content (drugs themselves, their attributes, relationships from drug to attributes and attribute mappings. Do not add extra rows if content is already present. If you didn't build the input tables, contact the Vocabulary Team.