Building the reference databases - AstrobioMike/JPL-HBCU-2020 GitHub Wiki

This page holds the code that was used to build each of our databases.

NOTE
This is just here to document how these were built, we don't need to run these commands as we can just take the pre-built databases as discussed here 🙂

Page Contents

Kraken2/Bracken

This was initially built on 20-July-2020, and fungi added on 29-July-2020. (Initially built on S1.Xxlarge instance.)

Creating conda environment

conda create -y -n kraken2 -c conda-forge -c bioconda -c defaults kraken2=2.0.9beta bracken=2.6.0

conda activate kraken2

Setting up kraken2 standard database

Following along with here.

mkdir kraken2-standard-db

Downloading and building reference database

This was initially built without fungi, so still detailed below that way, but this is not the optimum way to build this time-wise if doing it new, as this way does the build twice.

Downloading reference info (note, this also masks low-complexity regions by default):

kraken2-build --standard --db kraken2-standard-db/ --threads 42

Adding fungi

Making a copy in case things go south:

cp -r full-kraken2-standard-db/ full-kraken2-standard-db-plus-fungi/
kraken2-build --download-library fungi --db full-kraken2-standard-db-plus-fungi

Need to delete db files for it to build again:

rm full-kraken2-standard-db-plus-fungi/*.k2d full-kraken2-standard-db-plus-fungi/*kraken full-kraken2-standard-db-plus-fungi/*distrib seqid2taxid.map

And building:

kraken2-build --build --db full-kraken2-standard-db-plus-fungi --threads 42

Setting up Bracken

Roughly following along from here.

bracken-build -d full-kraken2-standard-db-plus-fungi -t 42 -l 150

Clean up

Removing intermediate files (saves a lot of space):

kraken2-build --clean --db full-kraken2-standard-db-plus-fungi/

See here for an example kraken2 and bracken run.


Ganon

Built on 29-July-2020. (Initially built on S1.Xxlarge instance.)

Creating conda environment

conda create -y -n ganon -c conda-forge -c bioconda -c defaults ganon=0.2.3 genome_updater=0.2.2

conda activate ganon

Setting up reference db

Downloading reference genomes

Generally following their instructions here. Matching what will be in kraken2's standard db: bacterial, archaeal, viral, fungi, and human genome (just missing UniVec_core, as I can't find it, but it's tiny and made to capture synthetic sequencing stuff like adapters).

genome_updater.sh -g archaea,bacteria,human,viral,fungi -d refseq -l "Complete Genome" -f genomic.fna.gz,assembly_report.txt -o refseq-complete-genomes-arc-bac-human-viral-fungi -b v1 -a -m -u -r -p -t 42

Building reference db

ganon build --db-prefix ganon-complete-genomes-arc-bac-human-viral-fungi --input-directory refseq-complete-genomes-arc-bac-human-viral-fungi/v1/files/ --input-extension "_genomic.fna.gz" -t 42

See here for an example ganon run.


Centrifuge

Built on 29-July-2020. (I had to do it on a server with more RAM than our instances can provide.)

Creating conda environment

  # blast is included for dustmasker
conda create -y -n centrifuge -c conda-forge -c bioconda -c defaults centrifuge=1.0.4_beta blast=2.9.0

conda activate centrifuge

Setting up reference db

Downloading reference genomes

Generally following their instructions here. Roughly matching what will be in kraken2's standard db: bacterial, archaeal, viral, and human genome (same as ganon, just missing UniVec_core, as I can't find it, but it's tiny).

centrifuge-download -o taxonomy taxonomy

centrifuge-download -P 42 -o library -m -d "archaea,bacteria,viral,fungi" refseq > seqid2taxid.map

Adding human:

centrifuge-download -o library -d "vertebrate_mammalian" -a "Chromosome" -t 9606 -c 'reference genome' refseq >> seqid2taxid.map

Catting sequences together:

cat library/archaea/*.fna > input-sequences.fna
cat library/vertebrate_mammalian/*.fna >> input-sequences.fna
cat library/viral/*.fna >> input-sequences.fna
cat library/fungi/*.fna >> input-sequences.fna
cat library/bacteria/*.fna >> input-sequences.fna

Building reference db

centrifuge-build -p 16 --bmax 1342177280 --conversion-table seqid2taxid.map \
                 --taxonomy-tree taxonomy/nodes.dmp --name-table taxonomy/names.dmp \
                 input-sequences.fna centrifuge-complete-genomes-arc-bac-human-viral-fungi

Removing unneeded files

rm -rf input-sequences.fna library/

See here for an example centrifuge run.