Module 4 Metagenome quality control and compositional analysis - HRGV/Marmics_Metagenomics GitHub Wiki

Metagenome quality control

We have looked at the metagenomic libraries from a read quality and DNA compositional perspective using FastQC. Another critical point to evaluate for any library you work with is its taxonomic composition.

We want to address the simple question - What is in the library?

There are multiple tools available to analyze the composition of metagenomic read sets at the read level, we have selected two that are complementary in their approach.

Kaiju

Kaiju - it classifies the reads on the protein level against several available databases. The most comprehensive database implemented for Kaiju are all bacterial, archaeal and small eukaryote sequences in nr. For detailed instructions how to install and run Kaiju, refer to their readme.

NOTE Kaiju only accepts fastq files that are uncompressed, but uncompressed read files are 3x as big as the .gz files and also prone to corruption. This can be solved by decompressing the fastq.gz files on the fly using a pipe (actually a process substitution because is uses brackets) with the zcat command into the -i und -j options that are the file handles for the paired end reads <(zcat 2425_A_run375_CTCACCAA_S5_L006_R1_001_f.fastq.gz). The advantage of this piping is that in the process substitution, we can control the number of reads fed to Kaiju by adding an internal pipe via head -n #reads*4 as head only hands down the defined number of lines and each read in our libraries comes on 4 lines.

Log into bigmem-2 (database is installed only there)

HINT Replace the library name in the example command with the actual library name of your data.
A final command set to run Kaiju on 100k test reads would be

kaiju -f /scratch/databases/KaijuDB/kaiju_db_nr_euk.fmi -t /scratch/databases/KaijuDB/nodes.dmp -i <(zcat 2425_A_run375_CTCACCAA_S5_L006_R1_001_f.fastq.gz | head -n 400000) -j <(zcat 2425_A_run375_CTCACCAA_S5_L006_R2_001_f.fastq.gz | head -n 400000) -a greedy -e 5 -o 2425_A.100k.kaiju.out.txt

-f sets the database file
-t sets the taxonomy tree database
-i and -j take the read1 and read2 files, we pipe into them
-a sets the run mode, greedy is the more specific one
-e sets the mismatches that are allowed, 5 is a good recommendation
-o sets the output file

To analyze the taxonomic classification on a genus level, reporting all genera that are more than 0.1% abundant you need to run

kaiju2table -n /scratch/databases/KaijuDB/names.dmp -t /scratch/databases/KaijuDB/nodes.dmp -r genus -m .1 -o 2425_A.100k.kaiju.genus-0.1.out.tsv 2425_A.100k.kaiju.out.txt

-n sets the taxonomic names database
-t sets the taxonomy tree database
-i sets the Kaiju output file to analyze
-o sets the classification output file
-r sets the taxonomic level, we use genus here, could be species to phylum
-m sets a cutoff of minimum % a taxon has to reach in the dataset to be reported, we use 0.1%

Please do not run Kaiju with more than 1 million reads as this will not finish within our practical.

phyloFlash

phyloFlash - is the perfect complementation to Kaiju as it looks at the samples on the rRNA level. It sorts the SSU rRNA reads for all domains of life and classifies them against a specially curated SILVA database by mapping them with bbmap. It then takes the SSU reads it has sorted and reconstructs or assembles full length SSU genes from the reads. The assembled SSUs are then classified against the SILVA database.
For detailed instructions how to install and run phyloFlash, refer to their readme.

phyloFlash is called with

  phyloFlash.pl -lib 2425_A -read1 2425_A_run375_CTCACCAA_S5_L006_R1_001_f.fastq.gz -read2 2425_A_run375_CTCACCAA_S5_L006_R2_001_f.fastq.gz -CPUs 10 -readlength 150 -almosteverything

-lib defines the library name, this is used for all output files
-read1 and -read2 define the paired end reads, they can be .gzipped
-CPUs sets the number of CPUs to 10, this is important as we are sharing a big machine and do not want to be greedy
-readlength sets the readlength of the library
-almosteverything sets the recommended options: run SPAdes (skip EMIRGE) and produce all optional outputs

the results are summarized in a file called LIBRARY.phyloFlash.html that is a html web page. We can display the webpage with any web browser e.g. firefox

 firefox 2425_A.phyloFlash.html

Multisample comparison with phyloFlash

If you have several phyloFlash runs and you want to compare them, you can use phyloFlash_compare.pl. This assumes that you have a working R installation and the following R packages installed: “optparse”, “ggplot2”, “reshape2”, “ggdendro”, “gtable”, “plyr”, “RColorBrewer”, “methods”, “grid”. Details on the process can be found in the phyloFlash Utilities Manual.

As a first step, collect all zipped phyloFlash result files you want to compare in a single folder. Move into this folder and execute on the commandline

  phyloFlash_compare.pl --allzip --task barplot --displaytaxa 10 --level 2 --out pF_compare_t10_l2

--allzip sets the analysis to all zipped phyloFlash output files found in the folder
--task barplot provides a barplot style graph of the taxa and their abundances
--displaytaxa sets the number of taxa to be displayed, in this case to 10, all other with be lumped in other
--level sets the taxonomic analysis level, 2 would be phylum level \

Phylogeny based comparison using phyloFlash assembled SSUs

Another way to complement the barplots based on the hit counts is to generate a collective phylogenetic tree of the assembled SSU sequences and their reference sequences. This aproach does not need R and might be easier to implement. For this, you need to collect all SSU assemblies, collect the references, deduplicate the references, align the sequences and generate a tree. A quick workflow for this based on the easy to install or already available tools uses vsearch, mafft and fasttree to generate this tree in nwk format that can be explored e.g. with itol. This workflow assumes that you have already collected all zipped phyloFlash result files you want to compare in a single folder.

Let's extract all the spades assemblies of the SSU genes from the phyloFlash zips

  for i in *tar.gz; do  tar -xf $i ${i%phy*}spades_rRNAs.final.fasta; done

as well as the hits in the SILVA database

  for i in *tar.gz; do  tar -xf $i ${i%phy*}all.dbhits.NR97.fa; done

and collect the spades assemblies of the SSU genes in a single multifasta

  cat * spades_rRNAs.final.fasta > pf_collection.fasta

and collect the database hits of the assemblies of the SSU genes in a second separate multifasta

  cat * all.dbhits.NR97.fa > pf_collection_dbhits.fasta

The database hits will be redundant, so let's cluster them at approximately species level and only keep one per species level clade

  vsearch --cluster_smallmem pf_collection_dbits.fasta --usersort --id 0.98 --centroids pf_collection_dbits.NR98.fasta --notrunclabels

Now we can collect the non-redundant reference hits and the spades assembled SSUs

  cat pf_collection_dbits.NR98.fasta pf_collection.fasta > pf_collection_dbhits_and_spades.fasta

We have to align the sequences, mafft should do the trick quite well. Assuming that you have <100 sequences to compare, mafft-xinsi that considers the secondary structure of the SSU likely would be the best choice, just replace mafft with mafft-xinsi in the command and output label...

  mafft --adjustdirection --thread -1 pf_collection_dbhits_and_spades.fasta > pf_collection_dbhits_and_spades_mafft.fasta

Now we need a tree - fasttree will provide a quick and reasonably good tree to assess the diversity

  fasttree -nt -gtr -quote pf_collection_dbhits_and_spades_mafft.fasta > pf_collection_dbhits_and_spades_mafft_fasttree.nwk

You can now upload the tree to itol and start exploring