Module 6 Binning and Visualization - HRGV/Marmics_Metagenomics GitHub Wiki
Assembly visualization
In this wiki library, 909_A
is used as an example. You need to replace it with your library.
Visualization of your assembly helps you to understand how complex your community is and to get an overview of the basic parameters that might help you to tease apart your genomes of interest
1. Contig connectivity
The assemblers we have used, spades and flye, are both graph based and generate an assembly graph from the data that we can use to understand the connections of assembled contigs. This concept is very powerful and can be used for automated binning, see e.g. GraphBin. To really understand what is going on, it is important to visualize the assembly. This can be best done with a tool called Bandage. Bandage constructs a visual representation of the assembly graph to see which contigs are connected to each other. These connections come from the assembly program and are based on the de Bruijn graph.
On the linux-desktop, start Bandage from terminal with
Bandage
For the spades assembly, please open Bandage and import the file called assembly-graph.fastg
, then click draw graph
. The input file is automatically produced by the spades assembler.
Example for library 909_A
:
Bandage has many more features - one useful tool is to use BLAST to search e.g. for SSU genes in the assembly that were reconstructed with phyloFlash - see module 4. Create a BLAST database and load the rRNAs from phyloFlash that you can find in the LIBNAME.spades_rRNAs.final.fasta
file. You can then e.g. draw only contigs that are connected to certain rRNAs.
2. Coverage-GC plot
During the assembly session, we have produced a mapping file with all reads aligned to our assembly. We also have created a file that contains our coverage statistics.
To visualize our assembly in a coverage-GC plot, we will use gbtlite.
--> you need the coverage statistics file that you have produced 909_A.covstats.txt
You can directly load the file generated, but for optimal use, please edit the file and remove the #
in the top line in front of ID. For this, you can use sed
: sed 's/#//' YOUR_COVSTATSFILE_HERE.txt > COVSTATS_fixed.txt
. This will allow gbtlite to show the contig IDs.
Binning of an assembly
MetaBat
Automated binning withMetaBat uses tetranucleotide frequencies and abundance information (as inferred from read coverage of each contig) to automatically group your contigs into separate bins.
0. Create links to your input files
ln -s path-to/909_A.bam ./909_A.bam
ln -s path-to/scaffolds.fasta ./909_A.fasta
1. create depth file:
This file provides coverage information of each contig to the binning tool. To create this file run the following 3 commands:
samtools sort 909_A.bam 909_A.sort
samtools index 909_A.sort.bam
jgi_summarize_bam_contig_depths --outputDepth 909_A.depth.txt 909_A.sort.bam
--outputDepth
specifies the output file with coverage information needed for running metabat in step 2
2. run metabat:
To run the binning tool, using the depth file 909_A.depth.txt
that you have created in step 1, run:
metabat2 -i 909_A.fasta -a 909_A.depth.txt -o 909_A.metabat -t 10 -m 1500 -v --unbinned
-i
your input assembly file
-o
your output identifier
-t
number of CPUs
-m
minimum contig length
-v
verbose mode
--unbinned
put everything that could not be binned into one file
3. extract contig names from created bin for visualization with gbtools
Now we want to see which contigs were grouped together into separate bins by MetaBat. For the visualization in gbtools we need a list of contigs that belong to each bin. Run the following command on each of your metabat bins to create this list.
grep ">" 909_A.metabat.1.fa | sed 's/>//g' > 909_A.metabat.1.contigNames
Visualization of bin with gbtools
Before you do anything you need to delete the #
in the beginning of your covstats file:
nano 909_A.covstats.txt
See gbtools for details
Log into R
Rstudio or R
alternatively
Then install required libraries:
install.packages("/bioinf/transfer/Marmics_SPMG2019/software/Rcpp_1.0.0.tar.gz",repos=NULL,type="source")
install.packages("/bioinf/transfer/Marmics_SPMG2019/software/plyr_1.8.4.tar.gz",repos=NULL,type="source")
install.packages("/bioinf/transfer/Marmics_SPMG2019/software/sp_1.3-1.tar.gz",repos=NULL,type="source")
install.packages("/bioinf/transfer/Marmics_SPMG2019/software/genome-bin-tools/R_source_package/gbtools_2.6.0.tar.gz",repos=NULL,type="source")
Now you you are ready to plot your genome:
open gbtools.Rmd in Rstudio
or, if working in R console:
To load required libraries:
library(plyr)
library(sp)
library(gbtools)
Load coverage-GC file
d <- gbt(covstats="/home/ransorge/teaching/909_A.covstats.txt")
Plot coverage-GC:
plot(d)
Import and visualize bins:
bin5.contigNames <- scan(file="/home/ransorge/teaching/909_A.metabat.1.contigNames",what=character())
d.bin5 <- gbtbin(shortlist=bin5.contigNames,x=d,slice=NA)
points(d.bin5,col="green",slice=1)
Save plot as picture
png("PATH-TO/bin.909_A.png")
plot(d)
points(d.bin5,col="green",slice=1)
dev.off()
OPTIONAL: manual binning with gbtools
Follow data import as before into R or Rstudio.
d <- gbt(covstats="/home/ransorge/teaching/909_A.covstats.txt")
plot(d)
for bin selection run the following command and select bin by manual clicking around the region of choise in coverage-GC plot:
bin1 <- choosebin(d, slice=1, save=TRUE, file="bin1.909_A.list")
Example how chosen bin can look:
Further options included in gbtools wiki to improve binning
Use PhylaAmphora to assign marker genes
extract 16S rRNA info
Check the quality of your genome-bins
Conserved marker gene counts
You can get a good idea how complete and clean your genome bins are by looking at the presence of single copy maker genes. Single copy marker genes are different for different levels of taxonomy, e.g. Bacteria share <100 genes while Gammaproteobacteria ~300. To optimally compute the completeness of a genome, one should use the specific gene set for the taxon you are analyzing. One tool that fully automates the process of finding the optimal gene set and then calculates completeness for the given gene set is Checkm.
Please collect all bins you would like to check in the folder /bioinf/transfer/Marmics_SPMG2019/checkm_tests
in your course folder.
The files should be labelled
YOURNAME_LIBRARY_BINDETAILS.fa
we will then collectively run the analysis for you with the following command:
checkm taxonomy_wf class Gammaproteobacteria ../checkm_tests/ ./checkm_Gammaproteobacteria -t 16 -x fa > checkm_all_Gammaproteobacteria_v1.txt
rRNAs and tRNAs
Every bacterial genome should have rRNAS - at least one 16S and one 23S rRNA - as well as a good set of tRNAs that cover the amino acids that are encoded for in the coding sequences.
You can quickly search for rRNAs in your genomes with barrnap and for tRNAs using ARAGORN(http://mbio-serv2.mbioekol.lu.se/ARAGORN/). Both programs are installed and available to you, run them on your bins and check to outputs.
barrnap YOURNAME_LIBRARY_BINDETAILS.fa
aragorn YOURNAME_LIBRARY_BINDETAILS.fa
Other binning resources
Binning and visualization
Tools for automated binning
Setup of the environment in case Bandage installed conda fails
change samtools PATH and load the qt library
nano ~/.bashrc
and add the following lines:
export LD_LIBRARY_PATH=/bioinf/software/qt/5.9/gcc_64/lib:$LD_LIBRARY_PATH
export PATH=/bioinf/transfer/Marmics_SPMG2019/software:$PATH
Write out the file (Ctrl+X) and then do
source ~/.bashrc
to be able to use the new software paths.
Test other software to make sure it runs:
metabat2
if you have problems with metabat2, please add the following line to your .bashrc
export PATH=/bioinf/transfer/Marmics_SPMG2019/software/metabat:$PATH