Kraken2 - ACHG2018/metagenomics-classification-tools GitHub Wiki

Installation

Please check system requirements before the installation.

Official installation instruction can be found here.

Other required installs:

  • Bracken - computes the abundance of species in DNA sequences

Custom Database

Once Kraken2 is installed, custom database will need to be created to accommodate our needs.

-- Instruction to be updated --

For now, refer to official documentation here.

Results

Run #1

Issues with creating custom DB from .fasta files

Creation of custom DB for Kraken2 consists of 3 steps:

  1. Install taxonomy (utilizing information from NCBI)
  2. Install reference library (.fasta files from ChallengeRefGenomes.tar.gz in our case)
  3. Build database using kraken2-build command

Problem Encountered

In order to add .fasta files to the database, following requirements from the manual must be met:

  • Sequences must be in a FASTA file (multi-FASTA is allowed)
  • Each sequence's ID (the string between the > and the first whitespace character on the header line) must contain either an NCBI accession number to allow Kraken 2 to lookup the correct taxa, or an explicit assignment of the taxonomy ID using kraken:taxid.

The headers in the .fasta files from ChallengeRefGenomes.tar.gz look similar to:

>CR_9_Contig_0

Attempting to add them without the taxonomy ID using 'kraken2-build --add-to-library CR_9.fasta --db test' throws an error:

scan_fasta_file.pl: unable to determine taxonomy ID for sequence CR_9.fasta

Run #2

Using BLAST, we first retrieve taxonomy IDs of the contigs residing within the multi-FASTA files and annotate the headers of the contigs accordingly. This is done to avoid problems encountered during Run #1.

-- Add scripts and other reference files used --

Command used:

Option details:

  • --report FILENAME: Print a report with aggregrate counts/clade to file
  • --paired: The filenames provided have paired-end reads
  • --use-names: Print scientific names instead of just taxids

kraken2 --paired Hello_World_R1.fa Hello_World_R2.fa --db kraken_db --report Hello_World_test_sample_result.txt --use-names --output Hello_World_test_result.txt

Standard Output

Detailed info about output format here.

-- Q: How is representative taxonomy ID determined when there are k-mers belonging to more than one taxonomy IDs?

C Example.2 Francisella tularensis subsp. holarctica (taxid 119857) 151|151 263:4 119857:3 A:35 263:23 119857:30 263:22 |:| 263:117

-- A: Majority rule doesn't seem to apply to the example above as there are more k-mers belonging to taxonomy ID 263.

Sample Report Output

Detailed info about output format here.

There is only one unclassified Example.# from kraken output (1/3000). The number of bacteria (1857) and the number of viruses (1000) do not seem to add up logically.

-- Q: Do these numbers from sample report output represent something else?

-- A: 137 of the contigs belong to "root", 5 to "cellular organisms".