Problem: 50,899 metabolite and 109,820 species-metabolite pairs are in KNApSAcK database. Average species-metabolite pairs per metabolite is only about two. We would like to know as many organisms as possible (for mass productions, etc).
Solution: We will try to add information from annotated abstracts.
Extract papers that contain both chemical names and organisms using annotations.
Compute correspondence between chemical ID of KNaPSAcK and
MeSH/ChEBI (that are used in PubTator)
Manually check these papers (or automatic support?)
TODO:
Estimate the cover ratio of chemical compounds appearing in annotations to those included in
KNapSAcK (A goal of this hackathon).
1-1. How to link from KNapSAcK ID to MeSH/ChEBI?
plan:
Convert KNapSAcK ID->KEGG compound ID->PubChem or ChEBI
Convert MeSH->PubChem
Compare two IDs.
Problem: Because of stereoisomers, ID conversion is not one-to-one mapping.
Consider how to narrow down candidate papers (Future work).
What we did:
We analyzed an inclusion relation between papers manually selected to construct KNapSAcK and papers including both chemical names and organism name in annotated abstract.
The number of papers including chemical names: 9547412
The number of papers including organism names: 12599725
The intersection of 1 and 2: 6318259
The number of papers having pubmed ID for 1000 reference papers of KNapSAcK: 158
The intersection of 3 and 4: 47 (the ratio 1/3 seems to be not so good but not so bad...)
TODO:
To read the abstracts of 111 papers to know why the chemical name / organism in the papers are not annotated by PubTator
Chemical names/organisms might not be written in abstact. Or there may be another reason.
To analyze an inclusion relation between organism-chemical pairs included in KNapSAcK and those obtained from annotated abstracts.
Reinforcement and supplement of data of PRIDE adding references that may include proteins.
Members: Shin Kawano
PRIDE: The PRIDE PRoteomics IDEntifications database is a centralized, standards compliant, public data repository for proteomics data, including protein and peptide identifications, post-translational modifications and supporting spectral evidence. http://www.ebi.ac.uk/pride/archive/
Problem: PRIDE provides only detected proteins and peptides list, the proteins have no annotations.
Solution: We will try to add information of cellular localization from annotated abstracts.