Developer Info - Rostlab/nalaf GitHub Wiki

Commit Messages

Scheme:

(#NR) (?major|minor) (?FIX) write short descriptive imperative message

#NR: Issue Number
(?major|minor): optional, importance of commit
FIX: optional, if the commit fixes the issue
body description: optional, describe sub-tasks in imperative voice

Branches

Use git-flow with master, developer and feature branches.

Dev Stack

We use Python 3 because:
- it will be more supported in the future
- default use for UTF8/Unicode
- Difficulty in writing software that works both for python 2 & 3

Data structure / Database

We store in a text file a list of the PMIDs that were analyzed to get sentences for annotation (with a high probablity of including mutation mentions)
We store in ann.jsons files who annotated what (either ml: or user (manual)), and confidence. When an automatic annotation had to be manually reviewed, the list of who will be ml:..., user:...

(As for for how to filter annotations by confidence, we either do it ourselves or use possible tagtog feature)

Resources

https://mutalyzer.nl/description-extractor

Clean (empty) environment test

Testing procedure for nala package. Installation and unit test testing in a clean anaconda environment.

conda create --no-default-packages -n cleanenv python setuptools
activate cleanenv
python setup.py install
python -m nala.download_corpora
python setup.py test
deactivate
conda env remove --name cleanenv

Bootstrapping Module

Diagram that explains the bootstrapping module

Diagram bootstrapping module

Folder Structure

root
	iteration_0
		base --> idp4
	iteration_1
		candidates
		reviewed
	iteration_2
		candidates
		reviewed
	iteration_3
		candidates
		reviewed
	stats.xls

Algorithm for Iteration N

base = read in base' of iteration 0 for i in 1..(n-1): rev = read in reviewed of iteration i base.append(rev)
generate binary model by training with base
generate candidates
- using docselector to get filtered pubmedids
- retrieve html documents of pubmedids and import them into our dataset
- run tagger using binary model on retrieved articles
- save retrieved articles with predictions into candidate folder
do manual annotation by
- using threshold module divide predicted labels into confirmed and preselected annotations (predicted: threshold)
- importing candidates into tagtog (could be also an alternative available e.g. interactive commandline)
- manually review imported data
- export from tagtog into anndoc format
- and save into reviewed folder
do evaluation by
- defining dataset = current base (iteration 0) + reviewed iterations (iterations 1..n)
- do k-fold-cross-validation on the defined dataset
  - divide data into k-sets
  - repeat k times
    - train on 1..k-1
    - test on k
  - save performance (average of k x k runs)

DocSelector - Selecting new viable documents

Algorithm DocSelector for Iteration N

Document selection with DocSelector to add new unknown documents

1. run UniProtDocumentSelector
	* input: given query (by default human swiss prot proteins)
	* output: pubmed ids of docs that are likely to contain mutations

2. run a serius of online Filters only on pubmeid
	input: list of pubmed_ids
	output: smaller list of pubmed_ids
	
	Instances:
		FilterByAlreadySeen
			* filter out all pubmed_ids used in iterations 1..N-1
	
2. run FromOnlinePubmedReader:
	* input: pubmed ids from step 1)
	* download the abstracts for each article
	* output: Dataset object
	
3. run a series of ofline Filters one after the other 
	input: Dataset object
	output: Dataset object with less documents in it 
	
	KeywordsFilter:
		* only keeps articles that have a given set of keywords in their title and or abstracts
	Natural Language filter
	
return Dataset object

Structure

# TODO diagram

Generation of PMIDs through some initial fetching of ids (in our case: UniProtDocumentSelector)
Online Filters (running on list of pubmedids) - need connection to internet, thus named "Online" Filters
Convert PMIDs to Documents by downloading each of them through the FromOnlinePubmedReader
Offline Filters (running on list of Documents) - need no connection but Text, thus named "Offline" Filters

High Recall Regex Document Filter

This DocumentFilter uses predictions from both Nala and tmVar in order to find new unknown and natural language mutation mentions using a customised set of regexs. The following diagram, shows the data-flow from filtered documents:

High Recall Regex DocumentFilter - Diagram