Vocabulary Development Process - OHDSI/Vocabulary-v5.0 GitHub Wiki
Vocabularies enter the overall Standardized Vocabulary system through a process called "Refresh". Even the initial introduction of a vocabulary is treated as a Refresh, except against an empty prior version. The Refresh introduces new or modifies existing records in the CONCEPT, CONCEPT_SYNONYM and CONCEPT_RELATIONSHIP tables. Other vocabulary tables such as VOCABULARY, RELATIONSHIP, CONCEPT_CLASS etc. are not handled through this process, they are maintained manually. Any refresh is accompanied by the quality assurance and control procedures.
Refreshing the Vocabularies and producing a new version consists of three steps:
I. Staging
Any number of new or existing concepts, synonyms or relationship records, or entire vocabularies, are sourced from the authoring institutions or developed manually and placed into a set of vocabulary staging tables. Each vocabulary has its own development schema for this step (e.g., "dev_icd10cm"). These tables are named like their production counterparts but with the ending "_STAGE". For example, the CONCEPT_STAGE table contains all new or modified concepts for the CONCEPT table. In addition, each schema also holds a fresh copy of the current vocabulary tables from DevV5 (the main development environment that has the current snapshot of vocabularies), called the "base tables", so that mapping and other relationships can be staged to existing content.
The content is created manually or downloaded from the source of the vocabulary and loaded into a vocabulary-specific database model. The script load_stage performs an ETL from the original format to the OMOP CDM. Each vocabulary has their own load_stage. There can be one such process, or more of them for a given vocabulary. For example, CPT4 and HCPCS are released annually by the AMA and CMS, respectively. These annual releases are downloaded and converted using one load_stage. Between these releases, however, new codes are sometimes published as urgent changes (e.g.,during the COVID-19 pandemic), which are handled by a separate load_stage and process.
After the load_stage run, the quality control (QC) script check_stage_table is executed. Any problems will be fixed by revisioning and rerunning the load_stage. Only if QC passes with no errors, the vocabulary is ready for the next step. There may be additional vocabulary-specific QA/QC checks located in corresponding vocabulary GitHub folder.
II. Integration
Staging is is followed by integration that has the following steps:
-
Each vocabulary is again staged and quality checked, but this time in DevV5. For that, the raw vocabulary is once more loaded from the source or copied from the stage schema, and load_stage is carried out re-creating the staging tables.
-
The script generic_update performs the same function as in the Staging step, except now in DevV5. Generic_update transfers the information from stage tables to base tables and sets valid_start and valid_end dates, either using today's date or a vocabulary_specific date defined through the LATEST_UPDATE parameter added to the VOCABULARY table using the SetLatestUpdate function which is a part of the load_stage script. This allows for integration of life cycle data from the vocabulary source.
-
Generic_update includes a number of steps enforcing rules, potentially overwriting incorrect information from the staging tables:
- Cleans up the concept_id in the stage table, copy concept_id from base to stage concept table for already existing concepts and generate new concept_id for new records.
- Cleans up concept_name and concept_synonym_name from double space, carriage return, newline, vertical tab, feed, long dash, trailing escape.
- Update (overwrite) concept details of existing concepts from the concept_stage table for concept_name, domain_id, concept_class_id, standard_concept, valid_end_date, invalid_reason fields, and the valid_start_date unless there is a latest_update variable placeholder.
- Forces concepts with valid_end_date of 2099-12-31 to valid status.
- Forces invalid concepts to non-standard status.
- For full replacement vocabularies, deprecates (makes non_standard and sets valid_end_date) concepts that are missing from the CONCEPT_STAGE table and relationships missing from CONCEPT_RELATIONSHIP_STAGE table.
- Sets a valid_end_date for concepts that are missing from the concept_stage table (only applied for deprecated but still standard (so-called "zombie") concepts.
- Creates “Maps to” itself for standard concepts and creates reverse relationships unless they exist.
- Updates valid_end_date: place today's date or the date from latest_update, or 2099-12-31 to active concepts.
The final integration step is running get_checks QA/QC script that checks integrity of the output (can be found here) followed by manual examination of the content (such as concepts changing domain, mappings of newly added concepts, changes in standard status, etc.).
III. Release
After all vocabularies are successfully integrated into the DevV5 environment, the release process is being triggered. The CONCEPT_ANCESTOR table is calculated from the renewed CONCEPT and CONCEPT_RELATIONSHIP tables, and metadata are written. Differences to the previous version are calculated and release notes produced and published.
When all tables are ready, they are copied to the ProdV5 environment, its extract is provided to the Athena server, and the release notes are published to GitHub.
The current release notes contain the summary information that is collected automatically, such as domain, concept and relationship changes.