Microbiome Helper 2 Annotation of reads with Kraken2 (and taxonomic profiling with Bracken) - LangilleLab/microbiome_helper GitHub Wiki
Authors: Robyn Wright,
Modifications by: NA
Based on initial versions by: André Comeau, Dhwani Desai and Gina Filloramo (Metagenomics Standard Operating Procedure v3)
Please note: We are still testing/developing this so use with caution :)
Use Kraken2 with memory-/time-saving parameters to create the taxonomic profiles using the NCBI RefSeq Complete database for identification. Kraken2 assigns taxonomy to all reads (vs. marker-gene approaches such as MetaPhlAn) and so we use that information here and for later constructing the stratified functional output. Note that here we are using a very large (~1.2 TB) reference database that will be prohibitive for most users (since it requires >1.2 TB of RAM). This can be substituted for one of the smaller pre-compiled Kraken2 databases, or one of the other databases that we have provided on Dropbox here or you can follow our instructions for creating your own custom database (link coming soon). You can read some further discussion on choosing the best database that you are able to given any computational restraints in our paper.
We have provided a few links for downloading full Kraken2 databases here.
Our full 1.2TB Kraken2 Complete RefSeq V205 database used in the paper:
aws s3 cp --recursive --no-sign-request s3://kraken2-ncbi-refseq-complete-v205/Kraken2_RefSeqCompleteV205 .
Langmead lab core nt database (238.2GB):
wget https://genome-idx.s3.amazonaws.com/kraken/k2_core_nt_20241228.tar.gz
Langmead lab RefSeq bacteria, archaea, viral, plasmic, human, UniVec Core, protozoa and fungi (93.2GB):
wget https://genome-idx.s3.amazonaws.com/kraken/k2_pluspf_20250402.tar.gz
Langmead lab RefSeq bacteria, archaea, viral, plasmic, human, UniVec Core, protozoa and fungi (reduced size; 14.9GB):
wget https://genome-idx.s3.amazonaws.com/kraken/k2_pluspf_16gb_20250402.tar.gz
We recommend copying the Kraken2 database into a ramdisk (a virtual storage space created from a computer's system memory (RAM), designed to function like a hard drive but with significantly faster access speeds); this means that it won't need to be loaded into RAM for each sample that you run, and it will be much quicker. For our 1.2TB database, it takes about an hour to copy the database into a ramdisk.
Make the ramdisk:
sudo mkdir /mnt/ramdisk
Set the size of the ramdisk:
sudo mount -t ramfs -o size=20g ramfs /mnt/ramdisk
Note that you will need to change this depending on the size of the database that you are using. You'll need at least this amount of RAM, though!
Now copy the database into the ramdisk:
sudo cp -r k2_pluspf_16gb_20250402/ /mnt/ramdisk/
Now, create the output directories:
mkdir kraken2_outraw
mkdir kraken2_kreport
Then run Kraken2 with the --memory-mapping mode which looks for the database in memory instead of loading it from the disk. Note that, aside from the underlying database used, the most important parameter which determines the quality of your results is the confidence threshold and we are using a value of 0.5 - there are several considerations when choosing a confidence threshold to use, such as whether you are more concerned with ensuring that you don't miss any taxa (no false negatives) or that you don't identify taxa that weren't actually there (no false positives), and the best confidence threshold to use also depends on which database you are using. We provide guidance on choosing an appropriate threshold in our paper. If you are using environmental samples, the threshold of 0.5 will likely lead to very few of your reads being classified.
We recommend initially choosing a representative subset of your samples and running them with a few different confidence thresholds - this will allow you to see at what point the proportion of reads classified really starts to drop off, as well as whether there is a point at which you no longer see taxa that you know to be in your samples. The appropriate threshold to use will depend a lot on the environment that your samples are from and the database that you are using.
We are assuming here that you already have Kraken 2 installed somewhere. For us, this is in an environment as described here. You'll want to activate the environment if this is the case:
conda activate kraken2-v2.17.1
parallel -j 1 'kraken2 --db /mnt/ramdisk/RefSeqV214_5Oct2022_AllBacArcVir_CompEuk_KrakOther/ \
--threads 4 --output kraken2_outraw/{/.}_0.5.kraken --report kraken2_kreport/{/.}_0.5.kreport \
--use-names --confidence 0.5 --memory-mapping {}' \
::: cat_reads/*.fastq
Note that if your reads were already joined before running kneaddata, you should replace
cat_readswithkneaddata_out. You will also need to replace the database with whichever you are using.
If you also plan to run GeCoCheck (see below) then you may want to run this with a confidence threshold of 0 also:
parallel -j 1 'kraken2 --db /mnt/ramdisk/RefSeqV214_5Oct2022_AllBacArcVir_CompEuk_KrakOther/ \
--threads 4 --output kraken2_outraw/{/.}.kraken --report kraken2_kreport/{/.}.kreport \
--use-names --confidence 0 --memory-mapping {}' \
::: cat_reads/*.fastq
The raw numbers output by Kraken2 are always supposed to be corrected afterwards with Bracken. Below are the examples for running the genus- and phylum-level corrections, but you can go through all levels by changing the -l flag (remember to change the output file extension such as ".genus.bracken" each time you change levels or it will overwrite previous files).
Important: Bracken annoyingly creates a temp file of the KReport in the same origin directory and appends a "Bracken" label to it (ie: makes a
*_bracken.kreportfile for each sample) - you have to delete those files, between each level that you make new outputs for, or the below commands will make extra/incorrect collated Bracken files (since works off of all .kreport files in the folder). Also note that we are also requiring at least 10 reads to have mapped to any specific taxon (-t 10) in order to filter out some of the false-positive noise (Bracken-recommended default).
mkdir bracken_out
parallel -j 4 'bracken -d /mnt/ramdisk/k2_pluspf_16gb_20250402/ \
-i {} -o bracken_out/{/.}.genus.bracken \
-r 150 -l G -t 10' \
::: kraken2_kreport/*.kreport
rm kraken2_kreport/*_bracken.kreport
parallel -j 4 'bracken -d /mnt/ramdisk/k2_pluspf_16gb_20250402/ \
-i {} -o bracken_out/{/.}.phylum.bracken \
-r 150 -l P -t 10' \
::: kraken2_kreport/*.kreport
We can now use an additional Bracken script (in the analysis_scripts folder of your Bracken install) to collate the result files (one per sample) into final summary files (one per level). Note below are again the examples for just the genus- and phylum-level files.
mkdir bracken_out_merged
combine_bracken_outputs.py --files bracken_out/*.genus.bracken -o bracken_out_merged/merged_output.genus.bracken
combine_bracken_outputs.py --files bracken_out/*.phylum.bracken -o bracken_out_merged/merged_output.phylum.bracken
Note that if you are using Python 3.x.x then you may get an error here of something like TypeError: 'dict_keys' object is not subscriptable. If you do, then you can simply edit some of the lines in the combine_bracken_outputs.py as such:
Line 113: sys.exit("Taxonomy IDs not matching for species %s: (%s\t%s)" % (name, taxid, sample_counts[name].keys()[0])) -> sys.exit("Taxonomy IDs not matching for species %s: (%s\t%s)" % (name, taxid, list(sample_counts[name].keys())[0]))
Line 133: taxid = sample_counts[name].keys()[0] -> taxid = list(sample_counts[name].keys())[0]
We have recently developed a new tool for "verifying" the taxonomic annotations given by Kraken 2 by mapping the reads back to reference genomes for the taxa identified by Kraken: GeCoCheck (see the preprint here). There are full directions for running and installing this on the GitHub page, but you can see how we're integrating it with our workflows here.
Activate the environment:
conda activate GeCoCheck-v0.0.4
Run GeCoCheck with a read limit of 100:
coverage_pipeline.py \
--processors 12 \
--sample_metadata metadata.csv \
--project_name NAME \
--fastq_dir cat_reads \
--kraken_kreport_dir kraken2_kreport \
--kraken_outraw_dir kraken2_outraw \
--output_dir GeCoCheck_out \
--genome_dir /bigpool/shared/GeCoCheck/genomes \
--bowtie2_db_dir /bigpool/shared/GeCoCheck/bowtie2_db \
--coverage_program Bowtie2 \
--read_lim 100 \
--skip_duplicate_check
Filter GeCoCheck output:
filter_kraken_outraw.py \
--kraken_outraw_dir kraken2_outraw/ \
--gecocheck_out_dir GeCoCheck_out/ \
--kraken_kreport_dir kraken2_kreport/ \
--read_limit 10 \
--sample_metadata metadata_GeCoCheck.csv \
--project_name KeslerMGS_RunNS191 \
--unchecked classified \
--for_stratified_out
mkdir GeCoCheck_out/kraken2_outraw_unverified
mv GeCoCheck_out/kraken2_outraw_checked/*unverified.kraken GeCoCheck_out/kraken2_outraw_unverified/
Use Kraken tools scripts to re-make kreport files, re-run Bracken, and combine outputs again:
conda activate kraken2-v2.17.1
python make_ktaxonomy.py --nodes /home/shared/RefSeqV214_5Oct2022_AllBacArcVir_CompEuk_KrakOther/taxonomy/nodes.dmp --names /home/shared/RefSeqV214_5Oct2022_AllBacArcVir_CompEuk_KrakOther/taxonomy/names.dmp --seqid2taxid /home/shared/RefSeqV214_5Oct2022_AllBacArcVir_CompEuk_KrakOther/seqid2taxid.map -o RefSeqV214_5Oct2022_AllBacArcVir_CompEuk_KrakOther_taxonomy.txt
mkdir GeCoCheck_out/kraken2_kreport_checked
parallel -j 1 'python make_kreport.py -i {} -t RefSeqV214_5Oct2022_AllBacArcVir_CompEuk_KrakOther_taxonomy.txt -o GeCoCheck_out/kraken2_kreport_checked/{/.}.kreport' ::: GeCoCheck_out/kraken2_outraw_checked/*
mkdir GeCoCheck_out/bracken_out_checked
parallel -j 12 'bracken -d /home/shared/RefSeqV214_5Oct2022_AllBacArcVir_CompEuk_KrakOther/ \
-i {} -o GeCoCheck_out/bracken_out_checked/{/.}.genus.bracken \
-r 150 -l G -t 10' \
::: GeCoCheck_out/kraken2_kreport_checked/*.kreport
rm GeCoCheck_out/kraken2_kreport_checked/*_bracken_genuses.kreport
parallel -j 12 'bracken -d /home/shared/RefSeqV214_5Oct2022_AllBacArcVir_CompEuk_KrakOther/ \
-i {} -o GeCoCheck_out/bracken_out_checked/{/.}.phylum.bracken \
-r 150 -l P -t 10' \
::: GeCoCheck_out/kraken2_kreport_checked/*.kreport
rm GeCoCheck_out/kraken2_kreport_checked/*_bracken_phylums.kreport
parallel -j 12 'bracken -d /home/shared/RefSeqV214_5Oct2022_AllBacArcVir_CompEuk_KrakOther/ \
-i {} -o GeCoCheck_out/bracken_out_checked/{/.}.species.bracken \
-r 150 -l S -t 10' \
::: GeCoCheck_out/kraken2_kreport_checked/*.kreport
rm GeCoCheck_out/kraken2_kreport_checked/*_bracken_species.kreport
combine_bracken_outputs.py --files GeCoCheck_out/bracken_out_checked/*.genus.bracken -o bracken_out_merged/merged_output_GeCoCheck.genus.bracken
combine_bracken_outputs.py --files GeCoCheck_out/bracken_out_checked/*.phylum.bracken -o bracken_out_merged/merged_output_GeCoCheck.phylum.bracken
combine_bracken_outputs.py --files GeCoCheck_out/bracken_out_checked/*.species.bracken -o bracken_out_merged/merged_output_GeCoCheck.species.bracken