Build STAG database for genomes - zellerlab/stag GitHub Wiki
To create a STAG database for genomes you will need many inputs. The analysis of a genome is based on a selection of marker genes. Let's say that we have two marker genes: COG12 and COG18.
Create STAG database for single genes
The first thing is to create a stag database for each of the marker genes.
Hence using stag train -o COG12.stagDB [...] and stag train -o COG18.stagDB [...]
Create STAG database for concatenated sequences
When concatenating alignments, we obtain higher resolution for the assignments. Hence, for the annotation of the entire genome, we use build a new stag database trained on the concatenation of all the marker genes. Note, it is important the order of the genes.
In order to create this database, we first need to create stag alignments of the training sequences using stag align. Then you will need to manually concatenate the alignments (taking care of missing genes as well). The result of this is a file, let's call it concatenated_alis.tsv.
Finally, you an use stag create_db to create a new database. Note: you need to provide a hmm file, but here it will not be used. Hence, you can put any dummy file.
A call like:
stag create_db -s concatenated_alis.tsv -x taxonomy_file -a dummy_file -o concatenated_ali.stagDB
will produce the needed database.
Define HMM thresholds
We need a file that will contain the genes hmm thresholds. The thresholds refer to the full sequence / score returned from hmmsearch.
This is a tab separated file (let's call it thresholds.tsv) like:
COG12.stagDB 60
COG18.stagDB 120
where the first column should have the same name as the single genes stag database, and the second column is the threshold score.
Note that the order of the genes MUST be the same as the order used in the concatenation of genes to create concatenated_ali.stagDB.
Create final STAG database
We can create the genome STAG database with:
stag train_genome -i COG12.stagDB,COG18.stagDB -T thresholds.tsv -C concatenated_ali.stagDB -o genomes.stagDB
Now, you can classify a genome with:
stag classify_genome -d genomes.stagDB [...]