Kraken2 - CBC-UCONN/software-example-guide GitHub Wiki

Contents

Kraken2 Database Creation

#!/bin/bash
#SBATCH --job-name=kraken2
#SBATCH -n 1
#SBATCH -N 1
#SBATCH -c 8
#SBATCH --mem=50G
#SBATCH --partition=general
#SBATCH --qos=general
#SBATCH --mail-type=ALL
#SBATCH [email protected]
#SBATCH -o %x_%A.out
#SBATCH -e %x_%A.err

hostname
date

module load blast/2.7.1
module load kraken/2.0.8-beta

kraken2-build --standard --threads 8 --db standard

This will download the standard Kraken2 database which include:

├── archaea
├── bacteria
├── human
├── UniVec_Core
└── viral

Creating a Custom Specific Database

Kranen provides reference libraries. Several sets of standard genomes/proteins are made easily available through kraken2-build command.

  • archaea: RefSeq complete archaeal genomes/proteins
  • bacteria: RefSeq complete bacterial genomes/proteins
  • plasmid: RefSeq plasmid nucleotide/protein sequences
  • viral: RefSeq complete viral genomes/proteins
  • human: GRCh38 human genome/proteins
  • fungi: RefSeq complete fungal genomes/proteins
  • plant: RefSeq complete plant genomes/proteins
  • protozoa: RefSeq complete protozoan genomes/proteins
  • nr: NCBI non-redundant protein database
  • nt: NCBI non-redundant nucleotide database
  • env_nr: NCBI non-redundant protein database with sequences from large environmental sequencing projects
  • env_nt: NCBI non-redundant nucleotide database with sequences from large environmental sequencing projects
  • UniVec: NCBI-supplied database of vector, adapter, linker, and primer sequences that may be contaminating sequencing projects and/or assemblies
  • UniVec_Core: A subset of UniVec chosen to minimize false positive hits to the vector database

Steps to creating a reference database

  1. Download the taxonomy
    This can be done using:
    kraken2-build --download-taxonomy --db $DBNAME

  2. To download any one of the above databases use:
    kraken2-build --download-library bacteria --db $DBNAME

  3. Once the disired databases have been downloaded to finalize you the following command:
    kraken2-build --build --db $DBNAME

Example script for creating a bacteria reference database:

#!/bin/bash
#SBATCH --job-name=kraken2
#SBATCH -n 1
#SBATCH -N 1
#SBATCH -c 8
#SBATCH --mem=50G
#SBATCH --partition=general
#SBATCH --qos=general
#SBATCH --mail-type=ALL
#SBATCH [email protected]
#SBATCH -o %x_%A.out
#SBATCH -e %x_%A.err

hostname
date

module load blast/2.7.1
module load kraken/2.0.8-beta

kraken2-build --download-taxonomy --db bacteria 

kraken2-build --download-library bacteria --db bacteria --threads 8  

kraken2-build --build --db bacteria --threads 8

This will create the following folder structure once it completes:

bacteria/
├── hash.k2d
├── library
│   └── bacteria
│       ├── assembly_summary.txt
│       ├── library.fna
│       ├── library.fna.masked
│       ├── manifest.txt
│       └── prelim_map.txt
├── opts.k2d
├── seqid2taxid.map
├── taxo.k2d
└── taxonomy
    ├── accmap.dlflag
    ├── citations.dmp
    ├── delnodes.dmp
    ├── division.dmp
    ├── gc.prt
    ├── gencode.dmp
    ├── merged.dmp
    ├── names.dmp
    ├── nodes.dmp
    ├── nucl_gb.accession2taxid
    ├── nucl_wgs.accession2taxid
    ├── prelim_map.txt
    ├── readme.txt
    ├── taxdump.dlflag
    ├── taxdump.tar.gz
    └── taxdump.untarflag

More specific information on building Kraken2 databases can be found at the Kraken2 home page.

Classifiy a Set of Sequences

Using Paired-end Sequences

#!/bin/bash
#SBATCH --job-name=kraken2
#SBATCH -n 1
#SBATCH -N 1
#SBATCH -c 8
#SBATCH --mem=50G
#SBATCH --partition=general
#SBATCH --qos=general
#SBATCH --mail-type=ALL
#SBATCH [email protected]
#SBATCH -o %x_%A.out
#SBATCH -e %x_%A.err

hostname
date

module load kraken/2.0.8-beta

kraken2 --db /isg/shared/databases/kraken2/Minikraken2_v1 \
        --fastq-input --paired R1.fastq R2.fastq \
        --use-names \
        --threads 8 \
        --output kraken_test.out \
        --unclassified-out unlassified#.fastq \
        --classified-out classified#.fastq      \
        --report kraken_report.txt \
        --use-mpa-style

This will create the sequence output and the reports,and the summary will be directed into standard output file.

├── classified_1.fastq
├── classified_2.fastq
├── kraken2_198062.err
├── kraken2_198062.out
├── kraken_report.txt
├── kraken_test.out
├── unlassified_1.fastq
└── unlassified_2.fastq

NOTE
Kraken2 databases are prepared in the Xanadu cluster and for more information please refer the database web page.