Methods - undiagnosed/metagenomics GitHub Wiki

Metagenomic classifiers

There are a large number of metagenomic classifiers available. This page documents the ones that have been tried.

Taxonomer

Available as a web service at Taxonomer. See the Taxonomer Classification Verification page for more information.

Kaiju

Available as a web service at Kaiju.

Aperiomics

Aperiomics uses a derivative of PathoScope. Their classifier is not publicly available, but reports are generated and delivered to customers.

Kraken

As far as I know, there is no webservice available to run the kraken classifier. Running the full classifier on a typical desktop computer is not possible as it requires 174 GB of RAM and 500 GB of disk space for the standard database. There is a MiniKraken version that can run on desktops with at least 8 GB RAM and 4 GB disk space [http://ccb.jhu.edu/software/kraken/MANUAL.html]. The sensitivity of MiniKraken is considerably lower than the full version, but the specificity is just as good. It can process over 1 million reads per minute so it is fast as well [https://ccb.jhu.edu/software/kraken].

Example usage with paired-end data

kraken --preload --db kraken_db --output kraken_results.txt --classified kraken_classified.fasta --unclassified kraken_unclassified.fasta --paired 10095201_R1.fastq 10095201_R2.fastq

BLAST

Nucleotide

While BLAST is computationally expensive, it is sensitive and if looking at a limited number of reads, is feasible to use. The following steps are borrowed from the shiver HIV sequence reconstruction pipeline and the ungap program which is part of the pipeline is used. You will also need the blast tools installed. The following example demonstrates how to do a nucleotide BLAST of reads against a reference database. In this example, HIV is the target pathogen.

First, download the reference sequences from the database of your choice. In this example, the Los Alamos 2016 Compendium reference sequences are used. The sequences should be in a fasta file. The following commands ungap the reference sequences, make a BLAST database, and BLAST reads against the database of reference sequences. In this case, only the forward reads are used.

python UngapFasta.py HIV1_COM_2016_1-9719_DNA.fasta > ungapped_hiv1_refs.fasta
makeblastdb -dbtype nucl -in ungapped_hiv1_refs.fasta -input_type fasta -out hiv_1
fastaq to_fasta Reads_1.fastq Reads_1.fasta
blastn -query Reads_1.fasta -db hiv_1 -out Reads_1.blast -max_target_seqs 1 -outfmt '10 qacc sacc sseqid evalue pident qstart qend sstart send'

Protein

blastx and tblastx TODO

HMMsearch and vFAM

Hidden Markov Models for divergent virus detection

TODO