Using QLever for PubChem - ad-freiburg/qlever GitHub Wiki

Setting up a SPARQL endpoint with QLever

Install the qlever script following the instructions https://github.com/ad-freiburg/qlever-control (this is a matter of a few minutes, no need to compile anything). Make sure that the PATH to the qlever script is set and that you are in a fresh directory with no other content. Then do:

qlever setup-config pubchem
qlever get-data
qlever index
qlever start
qlever ui

The get-data command downloads the data and fixes it (in several of the IRIs, forbidden characters are not properly percent-encoded). This takes around 5 hours on an AMD Ryzen 9 with 16 cores and requires about 250 GB of space. The index command builds the index data structures needed by QLever. This also takes around 5 hours and requires around 1.5 TB of disk space. The start command starts the server, which is then up in a matter of seconds. The ui command starts the UI, which looks just like the UI of the public QLever SPARQL endpoint for PubChem on https://qlever.cs.uni-freiburg.de/pubchem. See the Qleverfile (created by qlever setzup-config pubchem) for a more detailed description of some of the peculiarities of the PubChem dataset.

Additional ontologies

PubChem makes heavy use of alpha-numeric identifiers like sio:CHEMINF_000339 (molecular entity name) or obo:CHEBI_15365 (acetylsalicylic acid) for its predicates and entities. The labels for these identifiers are not part of the PubChem datasets. We recommend adding them to the data by downloading the respective ontologies. Here is a command to do that:

cut -d, -f3,4 <<EOT | while IFS=, read URL NAME; do echo "Downloading $URL -> $NAME ..."; curl --location --silent --remote-time --output rdf.ontologies/$NAME $URL; done
BAO - BioAssay Ontology,bao,http://www.bioassayontology.org/bao/bao_complete.owl,bao.rdf
BFO - Basic Formal Ontology,bfo,http://purl.obolibrary.org/obo/bfo.owl,bfo.rdf
BioPAX - biological pathway data,bp,http://www.biopax.org/release/biopax-level3.owl,bio-pax.rdf
CHEMINF - Chemical Information Ontology,cheminf,http://purl.obolibrary.org/obo/cheminf.owl,cheminf.rdf
ChEBI - Chemical Entities of Biological Interest,chebi,http://purl.obolibrary.org/obo/chebi.owl,chebi.rdf
CiTO,cito,http://purl.org/spar/cito.nt,cito.nt
DCMI Terms,dcterms,https://www.dublincore.org/specifications/dublin-core/dcmi-terms/dublin_core_terms.nt,dcterms.nt
FaBiO,fabio,http://purl.org/spar/fabio.nt,fabio.nt
GO - Gene Ontology,go,http://purl.obolibrary.org/obo/go.owl,go.rdf
IAO - Information Artifact Ontology,iao,http://purl.obolibrary.org/obo/iao.owl,iao.rdf
NCIt,ncit,http://purl.obolibrary.org/obo/ncit.owl,ncit.rdf
NDF-RT,ndfrt,https://data.bioontology.org/ontologies/NDF-RT/submissions/1/download?apikey=8b5b7825-538d-40e0-9e9e-5ab9274a9aeb,ndfrt.rdf
OBI - Ontology for Biomedical Investigations,obi,http://purl.obolibrary.org/obo/obi.owl,obi.rdf
OWL,owl,http://www.w3.org/2002/07/owl,owl.ttl
PDBo,pdbo,http://rdf.wwpdb.org/schema/pdbx-v40.owl,pdbo.rdf
PR - PRotein Ontology (PRO),pr,http://purl.obolibrary.org/obo/pr.owl,pr.rdf
RDF Schema,rdfs,https://www.w3.org/2000/01/rdf-schema,rdf-schema.ttl,rdfs.ttl
RDF,rdf,http://www.w3.org/1999/02/22-rdf-syntax-ns,22-rdf-syntax-ns.ttl,rdf.ttl
RO - Relation Ontology,ro,http://purl.obolibrary.org/obo/ro.owl,ro.rdf
SIO - Semanticscience Integrated Ontology,sio,http://semanticscience.org/ontology/sio.owl,sio.rdf
SKOS,skos,http://www.w3.org/TR/skos-reference/skos.rdf,skos.rdf
SO - Sequence types and features ontology,so,http://purl.obolibrary.org/obo/so.owl,so.rdf
UO - Units of measurement ontology,uo,http://purl.obolibrary.org/obo/uo.owl,uo.rdf
EOT

Basic properties and peculiarities of the PubChem RDF data

Compounds, substances, bioassasys

The PubChem data is about three central kind of entities:

  1. A compound is an abstract chemical structure, for example: compound:CID2244 (acetyl-salicylic acid)
  2. A substance is a concrete materialization of a compound, for example: substance:SID24890623 (a particular edition of Aspirin)
  3. A bioassay is an analytical method for measuring the effect of a substance on living matter

Names of compounds and substances

TLDR: There is no "canonical" name, neither for compounds nor for substances; each compound can have many substances; each substance can have many different kinds of names; each substance can even have multiple names of the same kind; some compounds are related to entities from other ontologies

Compounds are related to substances via the predicate sio:CHEMINF_000477 (has normalized counterpair), for example substance:SID24890623 sio:CHEMINF_000477 compound:CID2244

For each substance, there are different kinds of names, for example, sio_CHEMINF_000339 (molecular entity name) or sio_CHEMINF_000476 (chemical database identifier) or sio:CHEMINF_000561 (drug trade name). That way, even a single compound can have hundreds of names and synonyms, for example https://qlever.cs.uni-freiburg.de/pubchem/PAlJvI (all names/synonyms of Diclofenac) or https://qlever.cs.uni-freiburg.de/pubchem/7TwZLX (same, grouped by kind of name/synonym).

To get a particular kind of name of a particular substance do substance:SID24890623 sio:SIO_000008 [ rdf:type sio:CHEMINF_000339 ; sio:SIO_000300 ?name ], where the intermediate node is called a "synonym".

Some compounds are related to entities from other ontologies via rdf:type or closeMatch. For example, compound:CID2244 rdf:type obo:CHEBI_15365 (where obo:CHEBI_15365 is the identifier for acetylsalicylic acid in the ChEBI dictionary = Chemical Entities of Biological Interest) or compound:CID2244 skos:closeMatch wd:Q18216 (where wd:Q18216 is the identifier for Aspirin in Wikidata).

Properties

TLDR: Most properties in PubChem are not expressed via a single predicate, but via multiple predicates and entities

The various chemical properties of a compound are realized via the generic predicate sio:SIO_000008 (has attribute) and a mediator node. For example, molecular weight is realized as follows, using the specific sio:CHEMINF_000334 (molecular weight) and the generic sio:SIO_000300 (has value)

?compound sio:SIO_000008 [
  rdf:type sio:CHEMINF_000334 ;
  sio:SIO_000300 ?value ]