Applications - linkedannotation/blah2015 GitHub Wiki

Reinforcement and supplement of data of KNaPSAcK adding references that may include chemical compounds and species.
Members: Atsuko Yamaguchi, Toshiaki Tokimatsu
KNaPSAcK: a database of metabolites and organisms (mainly plants and microbes).
http://kanaya.naist.jp/knapsack_jsp/top.html
- Problem: 50,899 metabolite and 109,820 species-metabolite pairs are in KNApSAcK database. Average species-metabolite pairs per metabolite is only about two. We would like to know as many organisms as possible (for mass productions, etc).
- Solution: We will try to add information from annotated abstracts.
- Annotation:
  - PubTator (http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/PubTator/)
    - Sample data in pubannotation.org (http://pubannotation.org/projects/pubtator-sample)
  - tmChem (through PubTator)
- Outline of our method:
  - 1. Extract papers that contain both chemical names and organisms using annotations.
  - 1. Compute correspondence between chemical ID of KNaPSAcK and MeSH/ChEBI (that are used in PubTator)
  - 1. Manually check these papers (or automatic support?)
- TODO:
  - 1. Estimate the cover ratio of chemical compounds appearing in annotations to those included in KNapSAcK (A goal of this hackathon).
    - 1-1. How to link from KNapSAcK ID to MeSH/ChEBI?
      - plan:
        
        Convert KNapSAcK ID->KEGG compound ID->PubChem or ChEBI
        
        Convert MeSH->PubChem
        
        Compare two IDs.
      - Problem: Because of stereoisomers, ID conversion is not one-to-one mapping.
  - 1. Consider how to narrow down candidate papers (Future work).
- What we did:
  - We analyzed an inclusion relation between papers manually selected to construct KNapSAcK and papers including both chemical names and organism name in annotated abstract.
    - 1. The number of papers including chemical names: 9547412
    - 1. The number of papers including organism names: 12599725
    - 1. The intersection of 1 and 2: 6318259
    - 1. The number of papers having pubmed ID for 1000 reference papers of KNapSAcK: 158
    - 1. The intersection of 3 and 4: 47 (the ratio 1/3 seems to be not so good but not so bad...)
  - TODO:
    - To read the abstracts of 111 papers to know why the chemical name / organism in the papers are not annotated by PubTator
      - Chemical names/organisms might not be written in abstact. Or there may be another reason.
    - To analyze an inclusion relation between organism-chemical pairs included in KNapSAcK and those obtained from annotated abstracts.

Reinforcement and supplement of data of PRIDE adding references that may include proteins.
Members: Shin Kawano
PRIDE: The PRIDE PRoteomics IDEntifications database is a centralized, standards compliant, public data repository for proteomics data, including protein and peptide identifications, post-translational modifications and supporting spectral evidence.
http://www.ebi.ac.uk/pride/archive/
- Problem: PRIDE provides only detected proteins and peptides list, the proteins have no annotations.
- Solution: We will try to add information of cellular localization from annotated abstracts.
- Annotation:
  - LocText (https://www.tagtog.net/-corpora/loctext)
    - Sample data in pubannotation.org (http://pubannotation.dbcls.jp/projects/LocText)
- Outline of our method:
  1. Make a ProteinID-PMID correspondence table from LocText annotation
  2. Retrieve protein and peptide list from PRIDE API (http://wwwdev.ebi.ac.uk/pride/ws/archive/)
  3. Show summary and detail protein page including evidenced abstract and detected peptides
- TODO: *