Data Extraction - dbmi-pitt/np-terminology-imports GitHub Wiki

Overview of the use cases

The standardized natural products vocabulary will support several use cases -

Obtain all Latin binomials, common names, synonyms of a natural product term.
Obtain variations of natural product names used in spontaneous reporting systems (such as FAERS).
Provide standardized terms that can be used to generate pharmacovigilance signals at different levels of granularity, including the botanical natural product name, specific genus or species name, constituent(s), and combination products.
Obtain all natural products that are part of combination products containing one or more natural products.

The end results are concept records added to the OMOP/OHDSI standard vocabulary representing natural products and application of the natural products within the pharmacovigilance workflows.

Overview of string mapping and vocabulary construction

Obtain list of substances from the FDA UNII files. Substances in UNII list correspond to structurally diverse substances UUIDs in the Global Substance Registration System (GSRS).
Natural products (NP) and NP constituents are extracted from GSRS
NP common names manually curated from GSRS and Health Canada
NP spelling variations mapped from FAERS original drug strings to natural product names (manually and automatically).
Create OMOP/OHDSI standard vocabulary with new 'napdi' concepts
NP vocabulary concepts are mapped to RxNorm concepts used in FAERS
(Update 2023) - New FAERS 'HERBALS' strings added to vocabulary after manual mapping
(Update 2023) - Combination products included in vocabulary and mapped to preferred NP terms
Concepts and relationships in latest vocabulary version

Notes about Natural products in the FDA Substance Registration System (SRS)

The Global Substance Registration System (G-SRS), developed by the Ginas Project, is a software to assist agencies in registering and documenting information about substances found in medicines. It contains information about natural products (NPs) and their constituents. The database is available for download and local installation here.

In this project, we download and install G-SRS as a PostreSQL database. G-SRS contains 6 types of substances referenced in the ISO 11238 standard – chemicals, mixtures, polymers, proteins, nucleic acids and structurally diverse substances. Latin binomial names of NPs (such as Mitragyna speciosa (Kratom) and Cinnamomum verum (Cinnamon) are structurally diverse substances. We extract the substances using their Latin binomial names, parent substances, and parts of the substance (i.e. substances with Latin binomial names as parents).

Common names acquisition through Health Canada

Health Canada databases such as the Licensed Natural Health Products Database (LNHPD) contains natural product names and synonyms for Latin binomial names of the natural products.

Code: https://github.com/dbmi-pitt/NaPDI-pv/tree/master/np-terminology-imports/common-names
Input: a comprehensive list of latin binomials to search
Procedure: webprod_to_local_HTML.py obtains HTML output from the site by searching with the latin binomials. local_HTML_to_common_name.py parses the HTML to output a JSON file with clean latin binomial to common name mappings. The JSON file is converted to a TSV and loaded into the GSRS data base in the same schema as the tables that pull NP data (see above). Currently, the table is named lb_to_common_names_tsv. The file can then be manually edited to additional common names or they can be added when teh JSON is converted to TSV using convertToTsv.py.

RxNorm mappings

RxNorm is generally used for normalized names for clinical drugs and mapping drug names in spontaneous reports to standardized codes. RxNorm also contains some natural product ingredients and drug forms that are used in this vocabulary to map natural product terms and identify spontaneous reports with standardized codes.

Create table np_to_rxnorm with exact and substring matches for napdi vocabulary concepts (including constituents) to RxNorm terms - https://github.com/dbmi-pitt/np-terminology-imports/blob/main/scratch/np-vocabulary-mappings-rxnorm.sql.
Filter by concepts that are used in FAERS reports then manually annotated combination products (more below).
Create table np_to_rxnorm_annotated and include in vocabulary workflow.
Include RxNorm mappings with relationships 'napdi_np_maps_to' and 'napdi_const_maps_to'.

Adding combination products

Many NPs in the reference set and RxNorm mappings are actually combinations of one or more NPs or NP ingredients and are included in the latest vocabulary version. Combination products refer to any product containing one or more natural products (e.g. cinnamon garlic).
All combination products are manually marked based on an annotation guide in the NP spelling variations, HERBALS strings, and RxNorm concepts.
These are included in the vocabulary with concept class ID 'NaPDI NP Combination Product' and relationships 'napdi_pt_to_combo' and 'napdi_combo_to_pt'.

Add to OMOP concept table for standardization

Update GSRS NPs, common names, and constituents in lb_to_common_names_tsv, test_srs_np, and test_srs_np_constituent (currently in the scratch_sanya_2023 schema of the GSRS database). See GSRS database query notes.
update the manually curated NP spelling variations in np_faers_reference_set (currently in the scratch_sanya schema of the CEM database)
log into the database in an admin role and drop all prior NAPDI vocabulary concepts, relationships, and concept relationship mappings (see above)
run the SQL script NP terminology ETL
test that the vocabulary is working and makes sense using queries like the following:

NOTE: If an NP has multiple species, the L.B.s for the species can be mapped to different preferred terms. For example, Glycyrrhiza uralensis, Glycyrrhiza glabra, and Glycyrrhiza inflata all map to different P.T.s They are correct mappings in that they map to distinct common names found in GSRS. However, it means that our workflow will need to be as follows when we want to extract cases for a given LB with multiple species: list all of the L.Bs, query the vocab for the PT for each, use the PT concept ids as a concept set for the NP moving forward. That is not too bad and is similar to how we work with drugs.

NOTE: NP constituent spelling variations do not currently exist in the vocabulary addition. So, for example, Cannabidiol is in the vocab but CBD is not. This means that any work with constituents will need to consider spelling variations and add those to the study workflow

NOTE: both constituents and spelling variations can match to multiple NP preferred names as per the reference set and so a user/programmer needs to make sure not to have duplicate counts when running queries that reply on either.

Vocabulary (version April 2023) relationships and concept classes

Unique concept classes in vocabulary

Concept Class	Description
NaPDI Natural Product	Custom natural product terms in vocabulary (concept IDs < 0)
NaPDI Preferred Term	Preferred term for each natural product
NaPDI NP Spelling Variation	Curated spelling variations for each natural product term from FAERS
NaPDI NP Constituent	Constituents of natural products extracted from GSRS
NaPDI NP Combination Product	Combination products contain one or more natural products in the same term

Unique relationships in vocabulary

Relationship	Domain	Range
napdi_pt	Natural Product	NP Preferred Term
napdi_is_pt_of	NP Preferred Term	Natural Product
napdi_has_const	Natural Product	NP Constituent
napdi_is_const_of	NP Constituent	Natural Product
napdi_spell_vr	NP Spelling Variation	Natural Product
napdi_is_spell_vr_of	Natural Product	NP Spelling Variation
napdi_np_maps_to	NP Preferred Term	RxNorm code
napdi_const_maps_to	NP Constituent	RxNorm code
napdi_pt_to_combo	Preferred Term	NP Combination Product
napdi_combo_to_pt	NP Combination Product	Preferred Term