01_Metagenome bin visualization - esogin/seagrassOmics GitHub Wiki

Visualize binning results

EM Sogin

Created May 19,2019

Updated: May 25, 2019

Description: Visaulize binning results with both GB tools to get cov. by gc stats and differential coverage. Also, experiment with anvio to see wha this tool can help with in terms of analyses.

This page will focus on different ways to try to visualize bins from sediment metagenomes. Options for visualization:

  1. gbtools + r graphics
  2. Anvio
  3. Vizbin here

1. With gbtools

Goal is to create differential coverage plots between inside and the edge of the seagrass meadows to better understand which MAGs are more abundant inside the meadow. Also, we can create gc - coverage plots colored by bin.

First we need to prep the files for gbtools using the parse_bin_fasta_files.pl script that is part of the gbtools package. This works really well if you have a list of bins (file names) and bin ids. I split the lists up by high taxonomic groups to facilitate visualization so we are not trying to plot all bins at once.

general call: ~/tools/genome-bin-tools-master/accessory_scripts/parse_bin_fasta_files.pl -i archaea.bins -o archaea.bins.table.tsv

We also need to reformat covstats files from the mapping. See gbtools wiki on best practices.

Then we can start R and run the following commands for each bin set:

library(gbtools)
files<-list.files(pattern="*COVSTATS", recursive=T, include.dirs=T)
#Import covstats files 
d <- gbt(covstats=c("covstats/3847_A_COVSTATS", "covstats/3847_B_COVSTATS", "covstats/3847_C_COVSTATS","covstats/3847_D_COVSTATS","covstats/3847_E_COVSTATS", "covstats/3847_F_COVSTATS","covstats/3847_G_COVSTATS", "covstats/3847_H_COVSTATS","covstats/3847_I_COVSTATS"))
summary(d)
d.bins.archaea<-importBins(x=d, file="data/archaea.bins.table.tsv", to.list=T)

libs<-c('3847_A','3847_B','3847_C','3847_D','3847_E','3847_F','3847_G','3847_H','3847_I') 

bl<-list(d.bins.acido,d.bins.actino, d.bins.alpha, d.bins.alpha.kilo,d.bins.alpha.rhizo,d.bins.alpha.rhodo,d.bins.archaea,d.bins.bact,d.bins.desulfo,d.bins.gamma.gr1,d.bins.gamma.gr2,d.bins.gamma.gr3,d.bins.gamma.gr4,d.bins.gamma.pseudomonadales,d.bins.myo,d.bins.other,d.bins.other2)
lists<-c('acido','actino','alpha','alpha.kilo', 'alpha.rhizo','alpha.rhodo','archaea','bacteroida','desulfobacteria','gamma_gr1','gamma_gr2','gamma_gr3','gamma_gr4','gamma_pseudomonadales', 'myxococcota','other','other2')

# Create plots for all lists --> inspect plots
for (l in 1:length(bl)){
pdf(file=paste("gc_cov_plot_",lists[l],sep='',".pdf"),width=15, height=15)
par(mfrow=c(3,3))
	for(i in 1:9){
		multiBinPlot(d, bins=bl[l](/esogin/seagrassOmics/wiki/l), binNames=names(bl[l](/esogin/seagrassOmics/wiki/l)),slice=i,cutoff=10000, main=libs[i],legend=T, ylim=c(0.0001,1E3))
	}
	dev.off()
}

I used for loops to plot all libraries for each set of bins to make life a bit easier. In my .md notebooks, there is also code for generated diff coverage plots pairwise across habitats.

Overall impression:

  1. I identified 25 bins that showed higher abundance inside then at the edge of the meadow (from the diff coverage plots)
  2. These are really hard to plot all at once, need a different method to make it work.

New idea! Plot contig coverage for each bin of interest across libraries using a violin plot with ggplot2 Still take advantage of the info learned above with gbtools and the bin parser script developed by Brandon.

Workflow:

  1. get columns of interest from covstats files
libs=$(echo 3847_{A..I}_COVSTATS)
for i in $libs; do cut -f 1,2,3 $i > "$i"_forR;done

1a. Use Brandon's scripts to get bins+ contigs into R

~/tools/genome-bin-tools-master/accessory_scripts/parse_bin_fasta_files.pl -i bins -o all.bins.tsv 
  1. import covstats files into R and associated with bins
# Import COV Stats Files 
files<-list.files(pattern='forR',recursive=T, full.names=T)
LIBS<-c('3847_A','3847_B','3847_C','3847_D','3847_E','3847_F','3847_G','3847_H','3847_I')

#Read in covstats and paste into dataframe
covstats<-data.frame()
for (i in 1:length(files)){
lib_import<-read.table(files[i],header=T)
lib_import<-lib_import[which(lib_import$Length > 1000),]
lib_import$library<-LIBS[i]
covstats<-rbind(covstats, data.frame(lib_import))
}
# read bin files of meadow bins 
bins.meadow<-read.table('data/meadow.bins.tsv', header=F) #for the identified 25
bins.all<-read.table('data/all.bins.tsv', header=F) # for all

#choose which one to use
bins<-bins.all
colnames(bins)<-c('name','contigID')

#match to names
covstats$name<-bins[match(covstats$ID,bins$contigID),'name'] 
covstats2<-covstats[complete.cases(covstats),]

#Save for plotting
write.csv(covstats2,'results/annotated_covstats.csv')

This approach worked really nicely to highlight where bins are more abundant in the different habitats. Instead of focusing only on the 25 bins identified above, re-do plots for all bins (code now implemented above)

Next steps:

  1. Develop lists of In, Edge, Out bins across libraries --> Which bins associate more strongly with which habitat, what is background noise?
  2. Investigate meadow bins in greater detail
  3. Check to see if meadow bins align with predicting from 16S Reads -> in general we can find ASVs that have similar taxonomic strings to bins. They are now noted. ✅
  4. Re-create plot with all bins x habitat (may need to split up table to be effective)