Getting our pre built reference databases - AstrobioMike/JPL-HBCU-2020 GitHub Wiki

The databases can sometimes take while to build, so we built them beforehand (code used to build is here). Before downloading these with the commands below, we want to have added our additional storage volume as laid out here, so we can put them in that storage volume. They will not fit in our instance's regular storage.

Page contents

Finding our additional storage drive location

As detailed a little bit more here, after adding our additional storage, if we run ls / we should see our additional drive listed as something like vol_a, vol_b, or vol_c. We want to be in that location when running the copying commands below, e.g.:

cd /vol_b/

Copying over the databases

Downloading these can take a little bit, maybe over an hour for the larger ones. So you might want to consider running the commands inside a screen as introduced on this page 🙂

Kraken2/Bracken

This one was assembled as detailed here. It is ~48 GB. Be sure you are in your added storage volume (see top of page), and it can be downloaded from one of my instances with the following command (the password is the same as we've been using):

scp -r [email protected]:/vol_b/kraken2-db/ .

Example run

This example assumes we have installed these programs with conda and activated their environment as demonstrated here.

Getting tiny example data:

curl -L -o sample-1-R1.fq.gz https://ndownloader.figshare.com/files/23237460
curl -L -o sample-1-R2.fq.gz https://ndownloader.figshare.com/files/23237460

Kraken2

This is just an example. Parameters and settings are not special here. Consult their documentation and help menu (kraken2 -h) while figuring out how you want to run things 🙂

kraken2 --db kraken2-db/ --threads 6 \
        --output sample-1-kraken2-out.txt --report sample-1-kraken2-report.txt \
        --paired sample-1-R1.fq.gz sample-1-R2.fq.gz

Bracken

Same deal, this is just an example. Parameters and settings are not special here, consult their documentation and help menu (bracken -h) while figuring out how you want to run things 🙂

bracken -r 150 -d kraken2-db/ -i sample-1-kraken2-report.txt \
        -o sample-1-bracken-out.tsv

NOTE
Depending on how things are being evaluated, we may or may not need/want the bracken step. If the goal is to track what each individual read was assigned to, that might be better done with just the kraken2 output. If the goal is to compare expected relative abundances of taxa, that would be better done with the bracken output.

Ganon

This one was assembled as detailed here. It is ~100 GB. Be sure you are in your added storage volume (see top of page), and it can be downloaded from one of my instances with the following command (the password is the same as we've been using):

scp -r [email protected]:/vol_b/ganon-db/ .

Example run

This example assumes we have installed ganon with conda and activated its environment as demonstrated here.

Getting tiny example data:

curl -L -o sample-1-R1.fq.gz https://ndownloader.figshare.com/files/23237460
curl -L -o sample-1-R2.fq.gz https://ndownloader.figshare.com/files/23237460

This is just an example. Parameters and settings are not special here. Consult their documentation and help menu (ganon -h, ganon classify -h) while figuring out how you want to run things 🙂

ganon classify --db-prefix ganon-db/ganon-complete-genomes-arc-bac-human-viral-fungi \
               --paired-reads sample-1-R1.fq.gz sample-1-R2.fq.gz\
               -t 6 -o sample-1-ganon-out

We can filter/modify the output from that with the ganon report command, here's one example:

ganon report --db-prefix ganon-db/ganon-complete-genomes-arc-bac-human-viral-fungi \
             --rep-file sample-1-ganon-out.rep --ranks species \
             --output-report sample-1-ganon-out-species.tre

Centrifuge

This one was assembled as detailed here. It is ~34 GB. Be sure you are in your added storage volume (see top of page), and it can be downloaded from one of my instances with the following command (the password is the same as we've been using):

scp -r [email protected]:/vol_b/centrifuge-db/ .

Example run

This example assumes we have installed centrifuge with conda and activated its environment as demonstrated here.

Getting tiny example data:

curl -L -o sample-1-R1.fq.gz https://ndownloader.figshare.com/files/23237460
curl -L -o sample-1-R2.fq.gz https://ndownloader.figshare.com/files/23237460

This is just an example. Parameters and settings are not special here. Consult their documentation and help menu (centrifuge -h) while figuring out how you want to run things 🙂

centrifuge -x centrifuge-db/centrifuge-complete-genomes-arc-bac-human-viral-fungi \
           -1 sample-1-R1.fq.gz -2 sample-1-R2.fq.gz \
           -S sample-1-centrifuge-out.tsv --report-file sample-1-centrifuge-report.tsv \
           -k 1 -p 6

If helpful, we can make a kraken-style summary output, e.g.:

centrifuge-kreport -x centrifuge-db/centrifuge-complete-genomes-arc-bac-human-viral-fungi sample-1-centrifuge-out.tsv > sample-1-centrifuge-reformatted-out.tsv