Building the reference databases - AstrobioMike/JPL-HBCU-2020 GitHub Wiki
This page holds the code that was used to build each of our databases.
NOTE
This is just here to document how these were built, we don't need to run these commands as we can just take the pre-built databases as discussed here 🙂
Page Contents
Kraken2/Bracken
This was initially built on 20-July-2020, and fungi added on 29-July-2020. (Initially built on S1.Xxlarge instance.)
Creating conda environment
conda create -y -n kraken2 -c conda-forge -c bioconda -c defaults kraken2=2.0.9beta bracken=2.6.0
conda activate kraken2
Setting up kraken2 standard database
Following along with here.
mkdir kraken2-standard-db
Downloading and building reference database
This was initially built without fungi, so still detailed below that way, but this is not the optimum way to build this time-wise if doing it new, as this way does the build twice.
Downloading reference info (note, this also masks low-complexity regions by default):
kraken2-build --standard --db kraken2-standard-db/ --threads 42
Adding fungi
Making a copy in case things go south:
cp -r full-kraken2-standard-db/ full-kraken2-standard-db-plus-fungi/
kraken2-build --download-library fungi --db full-kraken2-standard-db-plus-fungi
Need to delete db files for it to build again:
rm full-kraken2-standard-db-plus-fungi/*.k2d full-kraken2-standard-db-plus-fungi/*kraken full-kraken2-standard-db-plus-fungi/*distrib seqid2taxid.map
And building:
kraken2-build --build --db full-kraken2-standard-db-plus-fungi --threads 42
Setting up Bracken
Roughly following along from here.
bracken-build -d full-kraken2-standard-db-plus-fungi -t 42 -l 150
Clean up
Removing intermediate files (saves a lot of space):
kraken2-build --clean --db full-kraken2-standard-db-plus-fungi/
See here for an example kraken2
and bracken
run.
Ganon
Built on 29-July-2020. (Initially built on S1.Xxlarge instance.)
Creating conda environment
conda create -y -n ganon -c conda-forge -c bioconda -c defaults ganon=0.2.3 genome_updater=0.2.2
conda activate ganon
Setting up reference db
Downloading reference genomes
Generally following their instructions here. Matching what will be in kraken2's standard db: bacterial, archaeal, viral, fungi, and human genome (just missing UniVec_core, as I can't find it, but it's tiny and made to capture synthetic sequencing stuff like adapters).
genome_updater.sh -g archaea,bacteria,human,viral,fungi -d refseq -l "Complete Genome" -f genomic.fna.gz,assembly_report.txt -o refseq-complete-genomes-arc-bac-human-viral-fungi -b v1 -a -m -u -r -p -t 42
Building reference db
ganon build --db-prefix ganon-complete-genomes-arc-bac-human-viral-fungi --input-directory refseq-complete-genomes-arc-bac-human-viral-fungi/v1/files/ --input-extension "_genomic.fna.gz" -t 42
See here for an example ganon
run.
Centrifuge
Built on 29-July-2020. (I had to do it on a server with more RAM than our instances can provide.)
Creating conda environment
# blast is included for dustmasker
conda create -y -n centrifuge -c conda-forge -c bioconda -c defaults centrifuge=1.0.4_beta blast=2.9.0
conda activate centrifuge
Setting up reference db
Downloading reference genomes
Generally following their instructions here. Roughly matching what will be in kraken2's standard db: bacterial, archaeal, viral, and human genome (same as ganon, just missing UniVec_core, as I can't find it, but it's tiny).
centrifuge-download -o taxonomy taxonomy
centrifuge-download -P 42 -o library -m -d "archaea,bacteria,viral,fungi" refseq > seqid2taxid.map
Adding human:
centrifuge-download -o library -d "vertebrate_mammalian" -a "Chromosome" -t 9606 -c 'reference genome' refseq >> seqid2taxid.map
Catting sequences together:
cat library/archaea/*.fna > input-sequences.fna
cat library/vertebrate_mammalian/*.fna >> input-sequences.fna
cat library/viral/*.fna >> input-sequences.fna
cat library/fungi/*.fna >> input-sequences.fna
cat library/bacteria/*.fna >> input-sequences.fna
Building reference db
centrifuge-build -p 16 --bmax 1342177280 --conversion-table seqid2taxid.map \
--taxonomy-tree taxonomy/nodes.dmp --name-table taxonomy/names.dmp \
input-sequences.fna centrifuge-complete-genomes-arc-bac-human-viral-fungi
Removing unneeded files
rm -rf input-sequences.fna library/
See here for an example centrifuge
run.