Microbiome Helper 2 Annotation of reads with Kraken2 (and taxonomic profiling with Bracken) - LangilleLab/microbiome_helper GitHub Wiki
Authors: Robyn Wright,
Modifications by: NA
Based on initial versions by: André Comeau, Dhwani Desai and Gina Filloramo (Metagenomics Standard Operating Procedure v3)
Please note: We are still testing/developing this so use with caution :)
Use Kraken2 with memory-/time-saving parameters to create the taxonomic profiles using the NCBI RefSeq Complete database for identification. Kraken2 assigns taxonomy to all reads (vs. marker-gene approaches such as MetaPhlAn) and so we use that information here and for later constructing the stratified functional output. Note that here we are using a very large (~1.2 TB) reference database that will be prohibitive for most users (since it requires >1.2 TB of RAM). This can be substituted for one of the smaller pre-compiled Kraken2 databases, or one of the other databases that we have provided on Dropbox here or you can follow our instructions for creating your own custom database (link coming soon). You can read some further discussion on choosing the best database that you are able to given any computational restraints in our paper.
We have provided a few links for downloading full Kraken2 databases here.
Our full 1.2TB Kraken2 Complete RefSeq V205 database used in the paper:
aws s3 cp --recursive --no-sign-request s3://kraken2-ncbi-refseq-complete-v205/Kraken2_RefSeqCompleteV205 .
Langmead lab core nt database (238.2GB):
wget https://genome-idx.s3.amazonaws.com/kraken/k2_core_nt_20241228.tar.gz
Langmead lab RefSeq bacteria, archaea, viral, plasmic, human, UniVec Core, protozoa and fungi (93.2GB):
wget https://genome-idx.s3.amazonaws.com/kraken/k2_pluspf_20250402.tar.gz
Langmead lab RefSeq bacteria, archaea, viral, plasmic, human, UniVec Core, protozoa and fungi (reduced size; 14.9GB):
wget https://genome-idx.s3.amazonaws.com/kraken/k2_pluspf_16gb_20250402.tar.gz
We recommend copying the Kraken2 database into a ramdisk (a virtual storage space created from a computer's system memory (RAM), designed to function like a hard drive but with significantly faster access speeds); this means that it won't need to be loaded into RAM for each sample that you run, and it will be much quicker. For our 1.2TB database, it takes about an hour to copy the database into a ramdisk.
Make the ramdisk:
sudo mkdir /mnt/ramdisk
Set the size of the ramdisk:
sudo mount -t ramfs -o size=20g ramfs /mnt/ramdisk
Note that you will need to change this depending on the size of the database that you are using. You'll need at least this amount of RAM, though!
Now copy the database into the ramdisk:
sudo cp -r k2_pluspf_16gb_20250402/ /mnt/ramdisk/
Now, create the output directories:
mkdir kraken2_outraw
mkdir kraken2_kreport
Then run Kraken2 with the --memory-mapping
mode which looks for the database in memory instead of loading it from the disk. Note that, aside from the underlying database used, the most important parameter which determines the quality of your results is the confidence threshold and we are using a value of 0.5 - there are several considerations when choosing a confidence threshold to use, such as whether you are more concerned with ensuring that you don't miss any taxa (no false negatives) or that you don't identify taxa that weren't actually there (no false positives), and the best confidence threshold to use also depends on which database you are using. We provide guidance on choosing an appropriate threshold in our paper. If you are using environmental samples, the threshold of 0.5 will likely lead to very few of your reads being classified.
We recommend initially choosing a representative subset of your samples and running them with a few different confidence thresholds - this will allow you to see at what point the proportion of reads classified really starts to drop off, as well as whether there is a point at which you no longer see taxa that you know to be in your samples. The appropriate threshold to use will depend a lot on the environment that your samples are from and the database that you are using.
parallel -j 1 'kraken2 --db /mnt/ramdisk/k2_pluspf_16gb_20250402/ \
--threads 4 --output kraken2_outraw/{/.}.kraken.txt --report kraken2_kreport/{/.}.kreport \
--use-names --confidence 0.5 --memory-mapping {}' \
::: cat_reads/*.fastq
Note that if your reads were already joined before running kneaddata, you should replace
cat_reads
withkneaddata_out
. You will also need to replace the database with whichever you are using.
The raw numbers output by Kraken2 are always supposed to be corrected afterwards with Bracken. Below are the examples for running the genus- and phylum-level corrections, but you can go through all levels by changing the -l
flag (remember to change the output file extension such as ".genus.bracken" each time you change levels or it will overwrite previous files).
Important: Bracken annoyingly creates a temp file of the KReport in the same origin directory and appends a "Bracken" label to it (ie: makes a
*_bracken.kreport
file for each sample) - you have to delete those files, between each level that you make new outputs for, or the below commands will make extra/incorrect collated Bracken files (since works off of all .kreport files in the folder). Also note that we are also requiring at least 10 reads to have mapped to any specific taxon (-t 10
) in order to filter out some of the false-positive noise (Bracken-recommended default).
mkdir bracken_out
parallel -j 4 'bracken -d /mnt/ramdisk/k2_pluspf_16gb_20250402/ \
-i {} -o bracken_out/{/.}.genus.bracken \
-r 150 -l G -t 10' \
::: kraken2_kreport/*.kreport
rm kraken2_kreport/*_bracken.kreport
parallel -j 4 'bracken -d /mnt/ramdisk/k2_pluspf_16gb_20250402/ \
-i {} -o bracken_out/{/.}.phylum.bracken \
-r 150 -l P -t 10' \
::: kraken2_kreport/*.kreport
We can now use an additional Bracken script (in the analysis_scripts
folder of your Bracken install) to collate the result files (one per sample) into final summary files (one per level). Note below are again the examples for just the genus- and phylum-level files.
mkdir bracken_out_merged
combine_bracken_outputs.py --files bracken_out/*.genus.bracken -o bracken_out_merged/merged_output.genus.bracken
combine_bracken_outputs.py --files bracken_out/*.phylum.bracken -o bracken_out_merged/merged_output.phylum.bracken
Note that if you are using Python 3.x.x then you may get an error here of something like TypeError: 'dict_keys' object is not subscriptable
. If you do, then you can simply edit some of the lines in the combine_bracken_outputs.py
as such:
Line 113: sys.exit("Taxonomy IDs not matching for species %s: (%s\t%s)" % (name, taxid, sample_counts[name].keys()[0]))
-> sys.exit("Taxonomy IDs not matching for species %s: (%s\t%s)" % (name, taxid, list(sample_counts[name].keys())[0]))
Line 133: taxid = sample_counts[name].keys()[0]
-> taxid = list(sample_counts[name].keys())[0]