Kraken2 - CBC-UCONN/software-example-guide GitHub Wiki
Contents
- Kraken2 Database Creation
- Creating a Custom Specific Database
- Classifiy a Set of Sequences (using paired-end sequences)
Kraken2 Database Creation
#!/bin/bash
#SBATCH --job-name=kraken2
#SBATCH -n 1
#SBATCH -N 1
#SBATCH -c 8
#SBATCH --mem=50G
#SBATCH --partition=general
#SBATCH --qos=general
#SBATCH --mail-type=ALL
#SBATCH [email protected]
#SBATCH -o %x_%A.out
#SBATCH -e %x_%A.err
hostname
date
module load blast/2.7.1
module load kraken/2.0.8-beta
kraken2-build --standard --threads 8 --db standard
This will download the standard Kraken2 database which include:
├── archaea
├── bacteria
├── human
├── UniVec_Core
└── viral
Creating a Custom Specific Database
Kranen provides reference libraries. Several sets of standard genomes/proteins are made easily available through kraken2-build command.
- archaea: RefSeq complete archaeal genomes/proteins
- bacteria: RefSeq complete bacterial genomes/proteins
- plasmid: RefSeq plasmid nucleotide/protein sequences
- viral: RefSeq complete viral genomes/proteins
- human: GRCh38 human genome/proteins
- fungi: RefSeq complete fungal genomes/proteins
- plant: RefSeq complete plant genomes/proteins
- protozoa: RefSeq complete protozoan genomes/proteins
- nr: NCBI non-redundant protein database
- nt: NCBI non-redundant nucleotide database
- env_nr: NCBI non-redundant protein database with sequences from large environmental sequencing projects
- env_nt: NCBI non-redundant nucleotide database with sequences from large environmental sequencing projects
- UniVec: NCBI-supplied database of vector, adapter, linker, and primer sequences that may be contaminating sequencing projects and/or assemblies
- UniVec_Core: A subset of UniVec chosen to minimize false positive hits to the vector database
Steps to creating a reference database
-
Download the taxonomy
This can be done using:
kraken2-build --download-taxonomy --db $DBNAME
-
To download any one of the above databases use:
kraken2-build --download-library bacteria --db $DBNAME
-
Once the disired databases have been downloaded to finalize you the following command:
kraken2-build --build --db $DBNAME
Example script for creating a bacteria reference database:
#!/bin/bash
#SBATCH --job-name=kraken2
#SBATCH -n 1
#SBATCH -N 1
#SBATCH -c 8
#SBATCH --mem=50G
#SBATCH --partition=general
#SBATCH --qos=general
#SBATCH --mail-type=ALL
#SBATCH [email protected]
#SBATCH -o %x_%A.out
#SBATCH -e %x_%A.err
hostname
date
module load blast/2.7.1
module load kraken/2.0.8-beta
kraken2-build --download-taxonomy --db bacteria
kraken2-build --download-library bacteria --db bacteria --threads 8
kraken2-build --build --db bacteria --threads 8
This will create the following folder structure once it completes:
bacteria/
├── hash.k2d
├── library
│ └── bacteria
│ ├── assembly_summary.txt
│ ├── library.fna
│ ├── library.fna.masked
│ ├── manifest.txt
│ └── prelim_map.txt
├── opts.k2d
├── seqid2taxid.map
├── taxo.k2d
└── taxonomy
├── accmap.dlflag
├── citations.dmp
├── delnodes.dmp
├── division.dmp
├── gc.prt
├── gencode.dmp
├── merged.dmp
├── names.dmp
├── nodes.dmp
├── nucl_gb.accession2taxid
├── nucl_wgs.accession2taxid
├── prelim_map.txt
├── readme.txt
├── taxdump.dlflag
├── taxdump.tar.gz
└── taxdump.untarflag
More specific information on building Kraken2 databases can be found at the Kraken2 home page.
Classifiy a Set of Sequences
Using Paired-end Sequences
#!/bin/bash
#SBATCH --job-name=kraken2
#SBATCH -n 1
#SBATCH -N 1
#SBATCH -c 8
#SBATCH --mem=50G
#SBATCH --partition=general
#SBATCH --qos=general
#SBATCH --mail-type=ALL
#SBATCH [email protected]
#SBATCH -o %x_%A.out
#SBATCH -e %x_%A.err
hostname
date
module load kraken/2.0.8-beta
kraken2 --db /isg/shared/databases/kraken2/Minikraken2_v1 \
--fastq-input --paired R1.fastq R2.fastq \
--use-names \
--threads 8 \
--output kraken_test.out \
--unclassified-out unlassified#.fastq \
--classified-out classified#.fastq \
--report kraken_report.txt \
--use-mpa-style
This will create the sequence output and the reports,and the summary will be directed into standard output file.
├── classified_1.fastq
├── classified_2.fastq
├── kraken2_198062.err
├── kraken2_198062.out
├── kraken_report.txt
├── kraken_test.out
├── unlassified_1.fastq
└── unlassified_2.fastq
NOTE
Kraken2 databases are prepared in the Xanadu cluster and for more information please refer the database web page.