Developer Info - Rostlab/nalaf GitHub Wiki
Commit Messages
Scheme:
(#NR) (?major|minor) (?FIX) write short descriptive imperative message
#NR
: Issue Number(?major|minor)
: optional, importance of commitFIX
: optional, if the commit fixes the issuebody description
: optional, describe sub-tasks in imperative voice
Branches
Use git-flow with master, developer and feature branches.
Dev Stack
- We use Python 3 because:
- it will be more supported in the future
- default use for UTF8/Unicode
- Difficulty in writing software that works both for python 2 & 3
Data structure / Database
- We store in a text file a list of the PMIDs that were analyzed to get sentences for annotation (with a high probablity of including mutation mentions)
- We store in ann.jsons files
who
annotated what (eitherml:
oruser
(manual)), andconfidence
. When an automatic annotation had to be manually reviewed, the list ofwho
will beml:..., user:...
(As for for how to filter annotations by confidence, we either do it ourselves or use possible tagtog feature)
Resources
Clean (empty) environment test
Testing procedure for nala package. Installation and unit test testing in a clean anaconda environment.
conda create --no-default-packages -n cleanenv python setuptools
activate cleanenv
python setup.py install
python -m nala.download_corpora
python setup.py test
deactivate
conda env remove --name cleanenv
Bootstrapping Module
Folder Structure
root
iteration_0
base --> idp4
iteration_1
candidates
reviewed
iteration_2
candidates
reviewed
iteration_3
candidates
reviewed
stats.xls
Algorithm for Iteration N
- base = read in base' of iteration 0 for i in 1..(n-1): rev = read in reviewed of iteration i base.append(rev)
- generate binary model by training with base
- generate candidates
- using docselector to get filtered pubmedids
- retrieve html documents of pubmedids and import them into our dataset
- run tagger using binary model on retrieved articles
- save retrieved articles with predictions into candidate folder
- do manual annotation by
- using threshold module divide predicted labels into confirmed and preselected annotations (predicted: threshold)
- importing candidates into tagtog (could be also an alternative available e.g. interactive commandline)
- manually review imported data
- export from tagtog into anndoc format
- and save into reviewed folder
- do evaluation by
- defining dataset = current base (iteration 0) + reviewed iterations (iterations 1..n)
- do k-fold-cross-validation on the defined dataset
- divide data into k-sets
- repeat k times
- train on 1..k-1
- test on k
- save performance (average of k x k runs)
DocSelector - Selecting new viable documents
Algorithm DocSelector for Iteration N
Document selection with DocSelector to add new unknown documents
1. run UniProtDocumentSelector
* input: given query (by default human swiss prot proteins)
* output: pubmed ids of docs that are likely to contain mutations
2. run a serius of online Filters only on pubmeid
input: list of pubmed_ids
output: smaller list of pubmed_ids
Instances:
FilterByAlreadySeen
* filter out all pubmed_ids used in iterations 1..N-1
2. run FromOnlinePubmedReader:
* input: pubmed ids from step 1)
* download the abstracts for each article
* output: Dataset object
3. run a series of ofline Filters one after the other
input: Dataset object
output: Dataset object with less documents in it
KeywordsFilter:
* only keeps articles that have a given set of keywords in their title and or abstracts
Natural Language filter
return Dataset object
Structure
# TODO diagram
- Generation of PMIDs through some initial fetching of ids (in our case: UniProtDocumentSelector)
- Online Filters (running on list of pubmedids) - need connection to internet, thus named "Online" Filters
- Convert PMIDs to Documents by downloading each of them through the FromOnlinePubmedReader
- Offline Filters (running on list of Documents) - need no connection but Text, thus named "Offline" Filters
High Recall Regex Document Filter
This DocumentFilter uses predictions from both Nala and tmVar in order to find new unknown and natural language mutation mentions using a customised set of regexs. The following diagram, shows the data-flow from filtered documents: