Data Extraction - dbmi-pitt/np-terminology-imports GitHub Wiki

Overview of the use case

Two applications:

  1. for the data repository to organize data
  2. for pharmacovigilance and knowledge graph studies
  • For use case (1) the end results is concept records added to the OMOP/OHDSI standard vocabulary representing natural products

  • For use (2) the end results application of the natural products within the pharmacovigilance and knowledge graph workflow

Elements of the mapping

  • Global substance registration system

  • Common names acquisition through Health Canada

  • OMOP/OHDSI standard vocabulary

  • FAERS spontaneous reports and the 'drug' strings present in those reports

Global substance registration system

Notes about Natural products in the FDA Substance Registration System (SRS)

The Global Substance Registration System (G-SRS), developed by the Ginas Project, is a software to assist agencies in registering and documenting information about substances found in medicines. It contains information about natural products (NPs) and their constituents. The database is available for download and local installation here.

In this project, we download and install G-SRS as a PostreSQL database. G-SRS contains 6 types of substances referenced in the ISO 11238 standard – chemicals, mixtures, polymers, proteins, nucleic acids and structurally diverse substances. Latin binomial names of NPs (such as Mitragyna speciosa (Kratom) and Cinnamomum verum (Cinnamon) are structurally diverse substances. We extract the substances using their Latin binomial names, parent substances, and 'parts' of the substance (i.e. substances with Latin binomial names as parents). A (rough) subset of the database schema used for extracting information about NPs is presented below –


GSRS database query notes (refer to image above):

  • Using uuid of substance (in JSON content from API result), query ix_ginas_substance (PK: uuid). Join to ix_ginas_strucdiv (PK: uuid, FK: structurally_diverse_uuid) and ix_ginas_name (PK: uuid, FK: owner_uuid) to insert into test_srs_np.
  • Parent substance are queried using the ix_ginas_substanceref table (PK: uuid, FK: ix_ginas_strucdiv.parent_substance_uuid) where uuid of parent substance is ix_ginas_substanceref.refuuid. Parents are inserted in both test_srs_np and test_srs_np_parent.
  • Parts of substance (i.e. substances with Latin binomial as parent) need to be extracted as these constitute generic substances such as Green Tea Leaf Extract, Cinnamon etc. They are also queried with the ix_ginas_substanceref table where ix_ginas_substanceref.refuuid = 'Latin binomial uuid' and ix_ginas_substanceref.uuid = 'Part substance uuid' and then inserted into test_srs_np_part with joins to ix_ginas_name and ix_ginas_strucdiv. Limitation: this only works if part substance also has substanceClass=structurallyDiverse. If part substance has a different class (for example, mixture), this query needs to be modified to extract details from ix_ginas_mixture.
  • Relationships of substances are queried from ix_ginas_relationship (FK: owner_uuid = ix_ginas_substance.uuid)
  • Identifiers of most ix_ginas_* tables are uuid(s). Names can include display name or name with 'type' (cn-common name, sys-systematic name, of-official name, bn-brand name, and cd-code name).


  1. G-SRS Website
  2. Link to paper
  3. ER diagram (created automatically through DBeaver)

Extract information about NP

  1. Find the Latin binomials of the NPs and create a dictionary with the JSON content of each NP extracted from the GSRS API. If RERUN_SRS_API_CALLS = TRUE, the script will call the SRS API to save information about the NP substance in file SRS_API_CALL_RESULTS_FILE. Pickle dictionary contains the uuid, substanceClass, and parent_uuid (if exists).
  2. Substance is assumed to be structurallyDiverse at the Latin binomial level. The 'whole substance' is extracted from the pickle dictionary by string matching the NP substance name to the results from SRS.
  3. Set NP_DB_SCHEMA and NP_DB_TABLE_PREFIX variables. The script then creates five tables to store substances, substance parents, substance parts, and relationships of substances and parts.
  4. Insert data into the tables with mapping from Latin binomial to common name (eg. green tea, kratom). Last run on 03/31/2021 saved all data to scratch_sanya - need to run for schema scratch_u54
  5. The function run_dsld_query executes the dsld_query to map the substances in test_srs_np, test_srs_np_parent, and test_srs_np_part to DSLD codes and save in table NP_DB_SCHEMA.NP_DB_TABLE_PREFIX_dsld

(enhancement) extract details about substances involved in relationships with NP (join from related_substance_uuid in test_srs_np_rel and test_srs_np_part_rel tables to ix_ginas_*), for example, metabolites (chemical), enantiomers (chemical), salts (chemical) and others listed here. - see below

(possible issue) while we assume that the whole substance after string matching will extract the correct species (eg. cannabis sativa whole), in some cases this may not be the species we want or there may be additional species of interest (eg. cannabis sativa subsp. indica whole) in SRS.


  1. Export data for vocabulary standardization. ...

Common names acquisition through Health Canada

  • Code:

  • Input: a comprehensive list of latin binomials to search

  • Procedure: obtains HTML output from the site by searching with the latin binomials. parses the HTML to output a JSON file with clean latin binomial to common name mappings. The JSON file is converted to a TSV and loaded into the GSRS data base in the same schema as the tables that pull NP data (see above). Currently, the table is named lb_to_common_names_tsv. The file can then be manually edited to additional common names or they can be added when teh JSON is converted to TSV using

Extract constituents (and other relations) of NP

Goal: extract details about substances involved in relationships with NP (join from related_substance_uuid in test_srs_np_rel and test_srs_np_part_rel tables to ix_ginas_*), for example, constituents, metabolites (chemical), enantiomers (chemical), salts (chemical) and others - steps below for constituents

  1. Selected constituent relations from list of relationships in SRS -

(enhancement) include constituent types from ix_ginas_substance using the constituent_uuid

(enhancement) include all other details about the constituent from ix_ginas_*

add to OMOP concept table for standardization

  1. Update GSRS NPs, common names, and constituents in lb_to_common_names_tsv, test_srs_np, and test_srs_np_constituent (currently in the scratch_sanya schema of the GSRS database). See GSRS database query notes.

  2. update the manually curated NP spelling variations in np_faers_reference_set (currently in the scratch_sanya schema of the CEM database)

  3. log into the database in an admin role and drop all prior NAPDI vocabulary concepts, relationships, and concept relationship mappings (see above)

  4. run the SQL script napdi_np_vocab_workflow_feb2022.sql

  5. test that the vocabulary is working and makes sense using queries like the following:

NOTE: If an NP has multiple species, the L.B.s for the species can be mapped to different preferred terms. For example, Glycyrrhiza uralensis, Glycyrrhiza glabra, and Glycyrrhiza inflata all map to different P.T.s They are correct mappings in that they map to distinct common names found in GSRS. However, it means that our workflow will need to be as follows when we want to extract cases for a given LB with multiple species: list all of the L.Bs, query the vocab for the PT for each, use the PT concept ids as a concept set for the NP moving forward. That is not too bad and is similar to how we work with drugs.

NOTE: NP constituent spelling variations do not currently exist in the vocabulary addition. So, for example, Cannabidiol is in the vocab but CBD is not. This means that any work with constituents will need to consider spelling variations and add those to the study workflow

NOTE: both constituents and spelling variations can match to multiple NP preferred names as per the reference set and so a user/programmer needs to make sure not to have duplicate counts when running queries that reply on either.