Kraken2 - ACHG2018/metagenomics-classification-tools GitHub Wiki
Installation
Please check system requirements before the installation.
Official installation instruction can be found here.
Other required installs:
- Bracken - computes the abundance of species in DNA sequences
Custom Database
Once Kraken2 is installed, custom database will need to be created to accommodate our needs.
-- Instruction to be updated --
For now, refer to official documentation here.
Results
Run #1
Issues with creating custom DB from .fasta files
Creation of custom DB for Kraken2 consists of 3 steps:
- Install taxonomy (utilizing information from NCBI)
- Install reference library (.fasta files from ChallengeRefGenomes.tar.gz in our case)
- Build database using
kraken2-build
command
Problem Encountered
In order to add .fasta files to the database, following requirements from the manual must be met:
- Sequences must be in a FASTA file (multi-FASTA is allowed)
- Each sequence's ID (the string between the > and the first whitespace character on the header line) must contain either an NCBI accession number to allow Kraken 2 to lookup the correct taxa, or an explicit assignment of the taxonomy ID using kraken:taxid.
The headers in the .fasta files from ChallengeRefGenomes.tar.gz look similar to:
>CR_9_Contig_0
Attempting to add them without the taxonomy ID using 'kraken2-build --add-to-library CR_9.fasta --db test' throws an error:
scan_fasta_file.pl: unable to determine taxonomy ID for sequence CR_9.fasta
Run #2
Using BLAST, we first retrieve taxonomy IDs of the contigs residing within the multi-FASTA files and annotate the headers of the contigs accordingly. This is done to avoid problems encountered during Run #1.
-- Add scripts and other reference files used --
Command used:
Option details:
- --report FILENAME: Print a report with aggregrate counts/clade to file
- --paired: The filenames provided have paired-end reads
- --use-names: Print scientific names instead of just taxids
kraken2 --paired Hello_World_R1.fa Hello_World_R2.fa --db kraken_db --report Hello_World_test_sample_result.txt --use-names --output Hello_World_test_result.txt
Standard Output
Detailed info about output format here.
-- Q: How is representative taxonomy ID determined when there are k-mers belonging to more than one taxonomy IDs?
C Example.2 Francisella tularensis subsp. holarctica (taxid 119857) 151|151 263:4 119857:3 A:35 263:23 119857:30 263:22 |:| 263:117
-- A: Majority rule doesn't seem to apply to the example above as there are more k-mers belonging to taxonomy ID 263.
Sample Report Output
Detailed info about output format here.
There is only one unclassified Example.# from kraken output (1/3000). The number of bacteria (1857) and the number of viruses (1000) do not seem to add up logically.
-- Q: Do these numbers from sample report output represent something else?
-- A: 137 of the contigs belong to "root", 5 to "cellular organisms".