Applications - linkedannotation/blah2015 GitHub Wiki

  • Reinforcement and supplement of data of KNaPSAcK adding references that may include chemical compounds and species.
  • Members: Atsuko Yamaguchi, Toshiaki Tokimatsu
  • KNaPSAcK: a database of metabolites and organisms (mainly plants and microbes).
    http://kanaya.naist.jp/knapsack_jsp/top.html
    • Problem: 50,899 metabolite and 109,820 species-metabolite pairs are in KNApSAcK database. Average species-metabolite pairs per metabolite is only about two. We would like to know as many organisms as possible (for mass productions, etc).
    • Solution: We will try to add information from annotated abstracts.
    • Annotation:
    • Outline of our method:
        1. Extract papers that contain both chemical names and organisms using annotations.
        1. Compute correspondence between chemical ID of KNaPSAcK and MeSH/ChEBI (that are used in PubTator)
        1. Manually check these papers (or automatic support?)
    • TODO:
        1. Estimate the cover ratio of chemical compounds appearing in annotations to those included in KNapSAcK (A goal of this hackathon).
        • 1-1. How to link from KNapSAcK ID to MeSH/ChEBI?
          • plan:
            • Convert KNapSAcK ID->KEGG compound ID->PubChem or ChEBI
            • Convert MeSH->PubChem
            • Compare two IDs.
          • Problem: Because of stereoisomers, ID conversion is not one-to-one mapping.
        1. Consider how to narrow down candidate papers (Future work).
    • What we did:
      • We analyzed an inclusion relation between papers manually selected to construct KNapSAcK and papers including both chemical names and organism name in annotated abstract.
          1. The number of papers including chemical names: 9547412
          1. The number of papers including organism names: 12599725
          1. The intersection of 1 and 2: 6318259
          1. The number of papers having pubmed ID for 1000 reference papers of KNapSAcK: 158
          1. The intersection of 3 and 4: 47 (the ratio 1/3 seems to be not so good but not so bad...)
      • TODO:
        • To read the abstracts of 111 papers to know why the chemical name / organism in the papers are not annotated by PubTator
          • Chemical names/organisms might not be written in abstact. Or there may be another reason.
        • To analyze an inclusion relation between organism-chemical pairs included in KNapSAcK and those obtained from annotated abstracts.

  • Reinforcement and supplement of data of PRIDE adding references that may include proteins.
  • Members: Shin Kawano
  • PRIDE: The PRIDE PRoteomics IDEntifications database is a centralized, standards compliant, public data repository for proteomics data, including protein and peptide identifications, post-translational modifications and supporting spectral evidence.
    http://www.ebi.ac.uk/pride/archive/