PubMedSaid - HealthHackAu2014/HealthHack2014 GitHub Wiki

Team name & bio

Team name: PubMedSaid

Problem presenters are: Presented by Dr Fabian Buske, Quek Xiu Cheng and Kenneth Sabir

Team Members:

Amanda Wise: Project manager, user experience designer at ThoughtWorks.
Quek Xiu Cheng: PhD Scholar at Garvan Institute of Medical Research
Andy Zeng: Data scientist at ThoughtWorks.
Chunxiao Lin: Software Engineer.
Manuel Paul Anil Kumar Joseph: Lead dev at ThoughtWorks.

The Problem

The rapidly growing body of scientific literature contains a wealth of information that is specifically written for a human audience. Reading through this information to extract links between data entities is however a time-consuming effort that requires expert knowledge.

In addition, with the advent new experimental techniques provide sets of hundreds of genes that requires investigation and this can be overwhelming. We need to understand what these large gene sets have in common and how they link to diseases such as cancer and diabetes. We have processed more than 2 million full-text scientific articles on biological and medical sciences contained within PubMed Central, extracted meaningful keywords and associated these with categories such as disease, genes, biological pathways, cellular function and many more. The resulting 6 billion data points have been deposited in a document database (MongoDB), that is available through a Web API.

However, while we have reduce the amount of information for each articles into keywords, visualisation these data remains a challenge. To begin with, data are store as identifiers and values that are not human readable. Furthermore, given the huge amount of data for any given gene or protein, it still remains a challenge to extract information most relevant to the researcher.

The Solution

Converting of the identifiers from the database into meaningful biological text and meta information for each article

A visual search tool that allows biomedical researchers to search for their gene or protein and retrieve all related literature from PubMed represented by related biological terms from multiple knowledge databases. Relevance of the query results will be ranked by the number of terms they appear in literature

The prototype

We have now a working prototype displaying related literature and biological information extracted from the following databases

Application/Relevance

The best way to search for research articles so far is to use google scholar where you can put keywords and it returns you a list of relevant articles. However in some scenarios this does not work at all.

Imagine a very common scenario in bioinformatics research - researchers can easily obtain a group of gens in hand which is not human readable at all. However they wish to understand it in an efficient way. The information they need include what this group of gens is about and what the related articles are there for them to read. In this case, they may need more awesome platform to do more smarter jobs.

PubMedSaid saves bioinformatics researchers' lives with their special requirement in research. For a group of gens, PubMedSaid can easily find out the best group of terms to describe this group of gens. For every term, the researchers can get a list of articles related to the term according to relevance.

This will greatly improve their efficiency in research by helping them to find the related diseases as articles to understand those disease based purely a group of genes which is not human readable at all.

Datasets

Currently the datasets used in the project include:

In-house full-text minded data
- 6 billion data-points amounting to 700GB in size
Disease
Go: in which we use three subsets:
- GO-CC
- GO-BP
- GO-MF
Wiki: where we related articles were downloaded and term frequencies were calculated.
PubMed: articles from PubMed were downloaded and term frequencies were calculated. The final search result of the project will be linked back to the relevant articles in PubMed.

We can easily expand to more databases in the near future.

Tech stack

MongoDB for holding the raw data
Python for backend processing of MongoDB queries and data from PubMed as well as berkeleybop
JSON for data representation
D3.js for data visualisation

Tradeoffs/analysis

What went well:

Data processing:
- It was clearly divided into four steps so that devs can work on pieces simultaliously
- All the steps were merged and went well for data process successfully
Front end visualisation:
- We successfully selected the best user interaction interface, the best layout and the best package to make it come true.
- Successful front end development
- Successful human resource management for front end developer.

what would you have done better:

Currently the data process is based on python script. A more reasonable solution will be running as backend services if there is plenty of time.

Future functionality

Visualisation of the comparison between two groups of GeneIDs
Refinement of the ranking of terms for each query, as well as the ranking of articles for each term, so that the user can get the most useful information