Community contribution: non‐drug vocabularies p.II - OHDSI/Vocabulary-v5.0 GitHub Wiki
This document is intended to be used for adding new complex vocabularies to the OHDSI Standardized Vocabularies or refreshing existing ones.
Authors: Oleg Zhuk, Anna Ostropolets.
Version: 0.1
Date of last modification: 03/12/2024
Adding a vocabulary
Community contribution part II is currently work in progress. If you would like to add a new complex vocabulary or refresh an existing one, please contact the Vocabulary Team or come to the Vocabulary WG meeting.
Refreshing a vocabulary
If you would like to refresh an existing vocabulary in the OHDSI Standardized Vocabularies (one that is not on the roadmap), you should first familiarize yourself with existing code in a corresponding folder on GitHub.
As of March 2024, you will be required to run the load stage on the new version of the vocabulary locally and perform QA/QC prior to forwarding it to the Vocabulary Team and initiating the official refresh.
Step 0: prerequisites
You will need a PostgreSQL database populated with up-to-date snapshot of the vocabularies dowloaded from Athena. An instruction on how to prepare a local PostgreSQL database can be found here Preparing a Local Environment for the Vocabulary Development Process. Vocabulary bundle depends on the vocabulary you want to refresh and generally includes a) the vocabulary itself and b) the vocabularies your vocabulary has relationships to (such as those your vocabulary maps to).
PostgreSQL database is required as processing scripts and QA/QC scripts are written for this dialect. It is a good idea to also install the following extensions:
- plpython3u
- plpgsql
Schemas preparation
Once you have the database set up, you will need to create three schemas:
- Schema for your source vocabulary (new version) downloaded from an external source (in Vocabulary server and scripts we use name "sources"). Put your source vocabulary here.
- Development schema (or so called work dir) where you will do the work (in Vocabulary server and scripts we use name such as "dev_name_of_your_vocabulary"). Run DDL to create empty tables. Put your copy of vocabularies downloaded from Athena here.
- Reference schema that has the copy of the vocabularies downloaded from Athena (in Vocabulary server and scripts we use "devv5" name). Run DDL to create empty tables. Put your copy of vocabularies downloaded from Athena here (two schemas will contain the copy of vocabularies at this point).
Step 1: load_stage
Execute load_stage from the corresponding GitHub folder in the development schema. If load_stage references "devv5" or "sources.nameofyourvocabulary" replace them with the names of your schemas.
Load_stage commonly contains the scripts to stage the vocabulary into the common format and references two types of PostreSQL procedures that you will need to create locally.
- VOCABULARY_PACK.SetLatestUpdate
We use the function here to set the date of the update and indicate that this vocabulary should be processed. Without creating this function and executing it you will not be able to proceed.
Please create this function and run it (the script below is also usually in your load_stage on GitHub):
DO $_$
BEGIN
PERFORM VOCABULARY_PACK.SetLatestUpdate(
pVocabularyName => 'vocabulary_id of your vocabulary as in Vocabulary table',
pVocabularyDate => 'date of new vocabulary version',
pVocabularyVersion => 'name of the vocabulary version, if none use date of the version',
pVocabularyDevSchema => 'name of your development schema'
);
END $_$;
- Supporting functions to stage tables
These usually include:
VOCABULARY_PACK.ProcessManualRelationships()
VOCABULARY_PACK.AddFreshMAPSTO()
VOCABULARY_PACK.DeprecateWrongMAPSTO()
VOCABULARY_PACK.DeleteAmbiguousMAPSTO()
You can find these functions in this folder, create them and execute at the end of the load_stage to ensure conformance with the vocabulary rules (such as no mapping to non-standard concept).
- Adding new concepts and mappings
Aside from these functions, your load_stage should incorporate all new content and possible changes, including manual work. For example, if you need to add mappings, add them to the concept_relationship_stage.
Example of mapping handling:
CONCEPT_STAGE
concept_id | concept_name | domain_id | vocabulary_id | concept_class_id | standard_concept | concept_code | valid_start_date | valid_end_date | invalid_reason |
---|---|---|---|---|---|---|---|---|---|
(keep null) | Diabetes type II (example of name of the concept from your vocabulary) | Condition (domain is determined based on the domain of the target mapping for non-standard concepts) | Your vocabulary (vocabulary_id from VOCABULARY table) | Condtion | (keep null since non-standard) | 12345 (code from your vocabulary) | 03/12/24 (current date) | 12/31/2099 (always this date for all valid concepts) | (keep null) |
CONCEPT_RELATIONSHIP_STAGE
concept_id_1 | concept_id_2 | concept_code_1 | concept_code_2 | vocabulary_id_1 | vocabulary_id_2 | relationship_id | valid_start_date | valid_end_date | invalid_reason |
---|---|---|---|---|---|---|---|---|---|
(keep null) | (keep null) | 12345 (code from your vocabulary) | 44054006 (concept_code of your target mapping, here corresponds to SNOMED 'Type 2 diabetes mellitus') | Your vocabulary (vocabulary_id from VOCABULARY table) | SNOMED (vocabulary_id of your target mapping) | Maps to | 03/12/24 (current date) | 12/31/2099 (always this date for all valid concepts) | (keep null) |
Similarly, if you need to add new non-standard concepts from the new version of the vocabulary, you will need to add them to concept_stage as shown above and add mappings to concept_relationship_stage.
Step 2: QA/QC p.I
As the result of the previous step, you will populate stage tables (concept_relationship_stage, concept_stage, etc.) that mimic basic tables (concept_relationship, concept, etc.). Once you run the load_stage from the corresponding folder, you will need to create and execute QA/QC procedures for stage tables.
Within the file, you will navigate to qa_tests.check_stage_tables () and create the function and execute it. The function performs conformance checks and should return no errors. The checks may be familiar to you if you used templates for community contribution and include things like:
- if the concept is valid, check if it valid_end_date = 12/31/2099
- field length does not exceed limits in the standard DDL
- there are no duplicates
- vocabulary_id existst in VOCABULARY table
and more.
Step 3: Generic Update
GenericUpdate function integrates the content of the _stage tables into basic tables (CONCEPT, CONCEPT_RELATIONSHIP, etc.) and assigns concept_ids. Running it enables examination of the content in the format you are familiar with (closer to what gets distributed to the Athena users).
Once you create this function and execute it, you will be able to inspect CONCEPT and CONCEPT_RELATIONSHIP. If at this point you need to modify something in your scripts or stage tables, you will need to clean stage tables and revert basic tables to their native state.
We use this function to erase all changes and get a copy of native basic tables from the reference schema. You can use it as well or erase tables manually.
Step 4: QA/QC p.II
This part of QA is done semi-automatically:
- execute get_checks function to automatically check the compliance of the resulting basic tables to the OMOP rules. The execution time may vary, it is a heavy query
- check high-level statistics (other functions from this folder)
- run manual_checks_after_generic to review everything manually.