Getting our pre built reference databases - AstrobioMike/JPL-HBCU-2020 GitHub Wiki
The databases can sometimes take while to build, so we built them beforehand (code used to build is here). Before downloading these with the commands below, we want to have added our additional storage volume as laid out here, so we can put them in that storage volume. They will not fit in our instance's regular storage.
Page contents
Finding our additional storage drive location
As detailed a little bit more here, after adding our additional storage, if we run ls /
we should see our additional drive listed as something like vol_a
, vol_b
, or vol_c
. We want to be in that location when running the copying commands below, e.g.:
cd /vol_b/
Copying over the databases
Downloading these can take a little bit, maybe over an hour for the larger ones. So you might want to consider running the commands inside a screen
as introduced on this page 🙂
Kraken2/Bracken
This one was assembled as detailed here. It is ~48 GB. Be sure you are in your added storage volume (see top of page), and it can be downloaded from one of my instances with the following command (the password is the same as we've been using):
scp -r [email protected]:/vol_b/kraken2-db/ .
Example run
This example assumes we have installed these programs with conda and activated their environment as demonstrated here.
Getting tiny example data:
curl -L -o sample-1-R1.fq.gz https://ndownloader.figshare.com/files/23237460
curl -L -o sample-1-R2.fq.gz https://ndownloader.figshare.com/files/23237460
Kraken2
This is just an example. Parameters and settings are not special here. Consult their documentation and help menu (kraken2 -h
) while figuring out how you want to run things 🙂
kraken2 --db kraken2-db/ --threads 6 \
--output sample-1-kraken2-out.txt --report sample-1-kraken2-report.txt \
--paired sample-1-R1.fq.gz sample-1-R2.fq.gz
Bracken
Same deal, this is just an example. Parameters and settings are not special here, consult their documentation and help menu (bracken -h
) while figuring out how you want to run things 🙂
bracken -r 150 -d kraken2-db/ -i sample-1-kraken2-report.txt \
-o sample-1-bracken-out.tsv
NOTE
Depending on how things are being evaluated, we may or may not need/want thebracken
step. If the goal is to track what each individual read was assigned to, that might be better done with just thekraken2
output. If the goal is to compare expected relative abundances of taxa, that would be better done with thebracken
output.
Ganon
This one was assembled as detailed here. It is ~100 GB. Be sure you are in your added storage volume (see top of page), and it can be downloaded from one of my instances with the following command (the password is the same as we've been using):
scp -r [email protected]:/vol_b/ganon-db/ .
Example run
This example assumes we have installed ganon
with conda and activated its environment as demonstrated here.
Getting tiny example data:
curl -L -o sample-1-R1.fq.gz https://ndownloader.figshare.com/files/23237460
curl -L -o sample-1-R2.fq.gz https://ndownloader.figshare.com/files/23237460
This is just an example. Parameters and settings are not special here. Consult their documentation and help menu (ganon -h
, ganon classify -h
) while figuring out how you want to run things 🙂
ganon classify --db-prefix ganon-db/ganon-complete-genomes-arc-bac-human-viral-fungi \
--paired-reads sample-1-R1.fq.gz sample-1-R2.fq.gz\
-t 6 -o sample-1-ganon-out
We can filter/modify the output from that with the ganon report
command, here's one example:
ganon report --db-prefix ganon-db/ganon-complete-genomes-arc-bac-human-viral-fungi \
--rep-file sample-1-ganon-out.rep --ranks species \
--output-report sample-1-ganon-out-species.tre
Centrifuge
This one was assembled as detailed here. It is ~34 GB. Be sure you are in your added storage volume (see top of page), and it can be downloaded from one of my instances with the following command (the password is the same as we've been using):
scp -r [email protected]:/vol_b/centrifuge-db/ .
Example run
This example assumes we have installed centrifuge
with conda and activated its environment as demonstrated here.
Getting tiny example data:
curl -L -o sample-1-R1.fq.gz https://ndownloader.figshare.com/files/23237460
curl -L -o sample-1-R2.fq.gz https://ndownloader.figshare.com/files/23237460
This is just an example. Parameters and settings are not special here. Consult their documentation and help menu (centrifuge -h
) while figuring out how you want to run things 🙂
centrifuge -x centrifuge-db/centrifuge-complete-genomes-arc-bac-human-viral-fungi \
-1 sample-1-R1.fq.gz -2 sample-1-R2.fq.gz \
-S sample-1-centrifuge-out.tsv --report-file sample-1-centrifuge-report.tsv \
-k 1 -p 6
If helpful, we can make a kraken-style summary output, e.g.:
centrifuge-kreport -x centrifuge-db/centrifuge-complete-genomes-arc-bac-human-viral-fungi sample-1-centrifuge-out.tsv > sample-1-centrifuge-reformatted-out.tsv