Data Extraction - dbmi-pitt/np-terminology-imports GitHub Wiki

Overview of the use cases

The standardized natural products vocabulary will support several use cases -

  1. Obtain all Latin binomials, common names, synonyms of a natural product term.
  2. Obtain variations of natural product names used in spontaneous reporting systems (such as FAERS).
  3. Provide standardized terms that can be used to generate pharmacovigilance signals at different levels of granularity, including the botanical natural product name, specific genus or species name, constituent(s), and combination products.
  4. Obtain all natural products that are part of combination products containing one or more natural products.
  • The end results are concept records added to the OMOP/OHDSI standard vocabulary representing natural products and application of the natural products within the pharmacovigilance workflows.

Overview of string mapping and vocabulary construction

  1. Obtain list of substances from the FDA UNII files. Substances in UNII list correspond to structurally diverse substances UUIDs in the Global Substance Registration System (GSRS).
  2. Natural products (NP) and NP constituents are extracted from GSRS
  3. NP common names manually curated from GSRS and Health Canada
  4. NP spelling variations mapped from FAERS original drug strings to natural product names (manually and automatically).
  5. Create OMOP/OHDSI standard vocabulary with new 'napdi' concepts
  6. NP vocabulary concepts are mapped to RxNorm concepts used in FAERS
  7. (Update 2023) - New FAERS 'HERBALS' strings added to vocabulary after manual mapping
  8. (Update 2023) - Combination products included in vocabulary and mapped to preferred NP terms
  9. Concepts and relationships in latest vocabulary version

Notes about Natural products in the FDA Substance Registration System (SRS)

The Global Substance Registration System (G-SRS), developed by the Ginas Project, is a software to assist agencies in registering and documenting information about substances found in medicines. It contains information about natural products (NPs) and their constituents. The database is available for download and local installation here.

In this project, we download and install G-SRS as a PostreSQL database. G-SRS contains 6 types of substances referenced in the ISO 11238 standard – chemicals, mixtures, polymers, proteins, nucleic acids and structurally diverse substances. Latin binomial names of NPs (such as Mitragyna speciosa (Kratom) and Cinnamomum verum (Cinnamon) are structurally diverse substances. We extract the substances using their Latin binomial names, parent substances, and parts of the substance (i.e. substances with Latin binomial names as parents).

Common names acquisition through Health Canada

Health Canada databases such as the Licensed Natural Health Products Database (LNHPD) contains natural product names and synonyms for Latin binomial names of the natural products.

  • Code: https://github.com/dbmi-pitt/NaPDI-pv/tree/master/np-terminology-imports/common-names

  • Input: a comprehensive list of latin binomials to search

  • Procedure: webprod_to_local_HTML.py obtains HTML output from the site by searching with the latin binomials. local_HTML_to_common_name.py parses the HTML to output a JSON file with clean latin binomial to common name mappings. The JSON file is converted to a TSV and loaded into the GSRS data base in the same schema as the tables that pull NP data (see above). Currently, the table is named lb_to_common_names_tsv. The file can then be manually edited to additional common names or they can be added when teh JSON is converted to TSV using convertToTsv.py.

RxNorm mappings

RxNorm is generally used for normalized names for clinical drugs and mapping drug names in spontaneous reports to standardized codes. RxNorm also contains some natural product ingredients and drug forms that are used in this vocabulary to map natural product terms and identify spontaneous reports with standardized codes.

Adding combination products

  • Many NPs in the reference set and RxNorm mappings are actually combinations of one or more NPs or NP ingredients and are included in the latest vocabulary version. Combination products refer to any product containing one or more natural products (e.g. cinnamon garlic).
  • All combination products are manually marked based on an annotation guide in the NP spelling variations, HERBALS strings, and RxNorm concepts.
  • These are included in the vocabulary with concept class ID 'NaPDI NP Combination Product' and relationships 'napdi_pt_to_combo' and 'napdi_combo_to_pt'.

Add to OMOP concept table for standardization

  1. Update GSRS NPs, common names, and constituents in lb_to_common_names_tsv, test_srs_np, and test_srs_np_constituent (currently in the scratch_sanya_2023 schema of the GSRS database). See GSRS database query notes.

  2. update the manually curated NP spelling variations in np_faers_reference_set (currently in the scratch_sanya schema of the CEM database)

  3. log into the database in an admin role and drop all prior NAPDI vocabulary concepts, relationships, and concept relationship mappings (see above)

  4. run the SQL script NP terminology ETL

  5. test that the vocabulary is working and makes sense using queries like the following:

NOTE: If an NP has multiple species, the L.B.s for the species can be mapped to different preferred terms. For example, Glycyrrhiza uralensis, Glycyrrhiza glabra, and Glycyrrhiza inflata all map to different P.T.s They are correct mappings in that they map to distinct common names found in GSRS. However, it means that our workflow will need to be as follows when we want to extract cases for a given LB with multiple species: list all of the L.Bs, query the vocab for the PT for each, use the PT concept ids as a concept set for the NP moving forward. That is not too bad and is similar to how we work with drugs.

NOTE: NP constituent spelling variations do not currently exist in the vocabulary addition. So, for example, Cannabidiol is in the vocab but CBD is not. This means that any work with constituents will need to consider spelling variations and add those to the study workflow

NOTE: both constituents and spelling variations can match to multiple NP preferred names as per the reference set and so a user/programmer needs to make sure not to have duplicate counts when running queries that reply on either.

Vocabulary (version April 2023) relationships and concept classes

Unique concept classes in vocabulary

Concept Class Description
NaPDI Natural Product Custom natural product terms in vocabulary (concept IDs < 0)
NaPDI Preferred Term Preferred term for each natural product
NaPDI NP Spelling Variation Curated spelling variations for each natural product term from FAERS
NaPDI NP Constituent Constituents of natural products extracted from GSRS
NaPDI NP Combination Product Combination products contain one or more natural products in the same term

Unique relationships in vocabulary

Relationship Domain Range
napdi_pt Natural Product NP Preferred Term
napdi_is_pt_of NP Preferred Term Natural Product
napdi_has_const Natural Product NP Constituent
napdi_is_const_of NP Constituent Natural Product
napdi_spell_vr NP Spelling Variation Natural Product
napdi_is_spell_vr_of Natural Product NP Spelling Variation
napdi_np_maps_to NP Preferred Term RxNorm code
napdi_const_maps_to NP Constituent RxNorm code
napdi_pt_to_combo Preferred Term NP Combination Product
napdi_combo_to_pt NP Combination Product Preferred Term