Module 6 Binning and Visualization - HRGV/Marmics_Metagenomics GitHub Wiki

Assembly visualization

In this wiki library, 909_A is used as an example. You need to replace it with your library.

Visualization of your assembly helps you to understand how complex your community is and to get an overview of the basic parameters that might help you to tease apart your genomes of interest

1. Contig connectivity

The assemblers we have used, spades and flye, are both graph based and generate an assembly graph from the data that we can use to understand the connections of assembled contigs. This concept is very powerful and can be used for automated binning, see e.g. GraphBin. To really understand what is going on, it is important to visualize the assembly. This can be best done with a tool called Bandage. Bandage constructs a visual representation of the assembly graph to see which contigs are connected to each other. These connections come from the assembly program and are based on the de Bruijn graph.

On the linux-desktop, start Bandage from terminal with

Bandage

For the spades assembly, please open Bandage and import the file called assembly-graph.fastg, then click draw graph. The input file is automatically produced by the spades assembler.

Example for library 909_A: assembly graph library 909_A

Bandage has many more features - one useful tool is to use BLAST to search e.g. for SSU genes in the assembly that were reconstructed with phyloFlash - see module 4. Create a BLAST database and load the rRNAs from phyloFlash that you can find in the LIBNAME.spades_rRNAs.final.fasta file. You can then e.g. draw only contigs that are connected to certain rRNAs.

2. Coverage-GC plot

During the assembly session, we have produced a mapping file with all reads aligned to our assembly. We also have created a file that contains our coverage statistics.

To visualize our assembly in a coverage-GC plot, we will use gbtlite.

--> you need the coverage statistics file that you have produced 909_A.covstats.txt

You can directly load the file generated, but for optimal use, please edit the file and remove the # in the top line in front of ID. For this, you can use sed: sed 's/#//' YOUR_COVSTATSFILE_HERE.txt > COVSTATS_fixed.txt. This will allow gbtlite to show the contig IDs.

Binning of an assembly

Automated binning with MetaBat

MetaBat uses tetranucleotide frequencies and abundance information (as inferred from read coverage of each contig) to automatically group your contigs into separate bins.

0. Create links to your input files

ln -s path-to/909_A.bam ./909_A.bam

ln -s path-to/scaffolds.fasta ./909_A.fasta

1. create depth file:

This file provides coverage information of each contig to the binning tool. To create this file run the following 3 commands:

samtools sort 909_A.bam 909_A.sort

samtools index 909_A.sort.bam

jgi_summarize_bam_contig_depths --outputDepth 909_A.depth.txt 909_A.sort.bam

--outputDepth specifies the output file with coverage information needed for running metabat in step 2

2. run metabat:

To run the binning tool, using the depth file 909_A.depth.txt that you have created in step 1, run:

metabat2 -i 909_A.fasta -a 909_A.depth.txt -o 909_A.metabat -t 10 -m 1500 -v --unbinned

-i your input assembly file

-o your output identifier

-t number of CPUs

-m minimum contig length

-v verbose mode

--unbinned put everything that could not be binned into one file

3. extract contig names from created bin for visualization with gbtools

Now we want to see which contigs were grouped together into separate bins by MetaBat. For the visualization in gbtools we need a list of contigs that belong to each bin. Run the following command on each of your metabat bins to create this list.

grep ">" 909_A.metabat.1.fa | sed 's/>//g' > 909_A.metabat.1.contigNames

Visualization of bin with gbtools

Before you do anything you need to delete the # in the beginning of your covstats file:

nano 909_A.covstats.txt

See gbtools for details

Log into R

Rstudio or R alternatively

Then install required libraries:

install.packages("/bioinf/transfer/Marmics_SPMG2019/software/Rcpp_1.0.0.tar.gz",repos=NULL,type="source")

install.packages("/bioinf/transfer/Marmics_SPMG2019/software/plyr_1.8.4.tar.gz",repos=NULL,type="source")

install.packages("/bioinf/transfer/Marmics_SPMG2019/software/sp_1.3-1.tar.gz",repos=NULL,type="source")

install.packages("/bioinf/transfer/Marmics_SPMG2019/software/genome-bin-tools/R_source_package/gbtools_2.6.0.tar.gz",repos=NULL,type="source")

Now you you are ready to plot your genome:

open gbtools.Rmd in Rstudio

or, if working in R console:

To load required libraries:

library(plyr)

library(sp)

library(gbtools)

Load coverage-GC file

d <- gbt(covstats="/home/ransorge/teaching/909_A.covstats.txt")

Plot coverage-GC:

plot(d)

Import and visualize bins:

bin5.contigNames <- scan(file="/home/ransorge/teaching/909_A.metabat.1.contigNames",what=character())

d.bin5 <- gbtbin(shortlist=bin5.contigNames,x=d,slice=NA)

points(d.bin5,col="green",slice=1)

Save plot as picture

png("PATH-TO/bin.909_A.png")

plot(d)

points(d.bin5,col="green",slice=1)

dev.off()

OPTIONAL: manual binning with gbtools

Follow data import as before into R or Rstudio.

d <- gbt(covstats="/home/ransorge/teaching/909_A.covstats.txt")

plot(d)

for bin selection run the following command and select bin by manual clicking around the region of choise in coverage-GC plot:

bin1 <- choosebin(d, slice=1, save=TRUE, file="bin1.909_A.list")

Example how chosen bin can look:

Further options included in gbtools wiki to improve binning
Use PhylaAmphora to assign marker genes
extract 16S rRNA info

Check the quality of your genome-bins

Conserved marker gene counts

You can get a good idea how complete and clean your genome bins are by looking at the presence of single copy maker genes. Single copy marker genes are different for different levels of taxonomy, e.g. Bacteria share <100 genes while Gammaproteobacteria ~300. To optimally compute the completeness of a genome, one should use the specific gene set for the taxon you are analyzing. One tool that fully automates the process of finding the optimal gene set and then calculates completeness for the given gene set is Checkm.

Please collect all bins you would like to check in the folder /bioinf/transfer/Marmics_SPMG2019/checkm_tests in your course folder.
The files should be labelled

YOURNAME_LIBRARY_BINDETAILS.fa

we will then collectively run the analysis for you with the following command:

checkm taxonomy_wf class Gammaproteobacteria ../checkm_tests/ ./checkm_Gammaproteobacteria -t 16 -x fa > checkm_all_Gammaproteobacteria_v1.txt

rRNAs and tRNAs

Every bacterial genome should have rRNAS - at least one 16S and one 23S rRNA - as well as a good set of tRNAs that cover the amino acids that are encoded for in the coding sequences.

You can quickly search for rRNAs in your genomes with barrnap and for tRNAs using ARAGORN(http://mbio-serv2.mbioekol.lu.se/ARAGORN/). Both programs are installed and available to you, run them on your bins and check to outputs.

barrnap YOURNAME_LIBRARY_BINDETAILS.fa

aragorn YOURNAME_LIBRARY_BINDETAILS.fa

Other binning resources

Binning and visualization

Blobology

Differential coverage binning

gbtools

Tools for automated binning

Setup of the environment in case Bandage installed conda fails

change samtools PATH and load the qt library

nano ~/.bashrc

and add the following lines:

export LD_LIBRARY_PATH=/bioinf/software/qt/5.9/gcc_64/lib:$LD_LIBRARY_PATH

export PATH=/bioinf/transfer/Marmics_SPMG2019/software:$PATH

Write out the file (Ctrl+X) and then do

source ~/.bashrc

to be able to use the new software paths.

Test other software to make sure it runs:

metabat2

if you have problems with metabat2, please add the following line to your .bashrc

export PATH=/bioinf/transfer/Marmics_SPMG2019/software/metabat:$PATH