Metagenomic Co Assembly Proof of Concept - ababaian/serratus GitHub Wiki

By Tomer Altman, Connor Morgan-Lang, and Ryan McLaughlin**

(** Big thanks to Ryan for doing a ton of work to put all the data together!)

Abstract

We set out to demonstrate that when you co-assemble read sets from many samples from a common source, you can recover more complete genomes, and genomes of rare members of the community. To this end, we fetched all 50 runs of PRJNA602689, which contains runs SRR10951656 and SRR10951660; the "low coverage Infectious Bronchitis Virus in Sus scrofa" study. So while we did not perform assembly for all of the benchmark datasets, we zeroed in on a particular use case to see if we could recover additional viral contigs. Below we present our approach, our findings, and instructions on fetching the data resources.

Methods

Samples were downloaded from the SRA using the prefetch (v2.8.0) and fastq files extracted using fastq-dump (v2.8.0). The Sus scrofa reference genome (GCF_000003025.6) was downloaded from NCBI. The Sus scrofa genome was indexed for bmfilter (v.3.102.4) and srprism (2.4.24-alpha) and a blastdb was created for blastn (2.5.0+), all for use by bmtagger (1.1.0). Paired-end reads for all samples were run through bmtagger to remove contaminating reads from the host (Sus s.). Blacklisted reads were filtered from all fastqs using seqmagick (v0.8.0). Reads were trimmed for quality using bbduk (v38.79). Finally all QC'ed samples were concatenated then co-assembled with MegaHIT (v1.2.9). Binning was preformed on contigs from the co-assembly with MetaBat2 (v2.12.1). Bin quality, presence and quality of recovered viral contigs, was assessed using CheckV (v0.6.0). We ran NCBI Blast in blastn mode against a local copy of the NT database, restricting the search to only viral entries, to annotate the contigs.

The entire workflow except for the Blast annotation was performed at a UBC computing grid, on a machine with 8 cores and approximately 64 GB RAM.

Results

Pipeline Performance

Read filtering and trimming took approximately 45 minutes and <16 GB RAM. The combined QC'ed read pairs numbered 30,152,318, and took up 15.3 GB on disk. It took 127 minutes to run the co-assembly, with an approximate RAM utilization of 53 GB and resulted in 2,395 contigs of length greater than 500bp. The binning took 20 minutes and generated 70 bins. Running CheckV on all of the bins took <1 minute and <1 GB RAM per bin.

Quality metric output from MegaHIT for the co-assembly:

2395 contigs, total 1818119 bp, min 500 bp, max 15999 bp, avg 759 bp, N50 702 bp
Time elapsed: 2346.165702 seconds

Quality metric output from MegaHIT for run SRR10951656:

2020-06-04 06:37:36 - 62 contigs, total 48807 bp, min 501 bp, max 2010 bp, avg 787 bp, N50 760 bp
2020-06-04 06:37:36 - ALL DONE. Time elapsed: 25.971600 seconds

Quality metric output from MegaHIT for run SRR10951660:

2020-06-04 06:39:34 - 49 contigs, total 38468 bp, min 502 bp, max 3121 bp, avg 785 bp, N50 783 bp
2020-06-04 06:39:34 - ALL DONE. Time elapsed: 28.095327 seconds 

Viral Findings

The following table lists the name of the organism, and the cumulative length of contigs with annotations to that organism, for organisms with a cumulative length greater than 1,000:

Infectious bronchitis virus     27753
Dickeya phage phiDP10.3 15708
Avian avulavirus 1      15205
Porcine astrovirus 4    10996
Moraxella phage Mcat16  8915
uncultured human fecal virus    8261
Mamastrovirus 3 6978
Pseudomonas phage PN05  4099
Epstein-Barr virus      2652
Pseudomonas phage vB_Pae_CF53a  1336
Porcine bocavirus       1221
Porcine rotavirus       1172

We discuss a select number of these sequences below.

Infectious Bronchitis Virus

We recovered a bin of IBV with 27,753 bp across three contigs of length 15,999, 8,971, 2,783, respectively. While the right target size for a coronavirus, CheckV declared the largest contig of medium-quality, and the other contigs of low quality, despite the Blast results showing high identity:

# Fields: query acc.ver, subject acc.ver, subject tax id, subject com names, query length, subject length, alignment length, % identity, % query coverage per subject, bit score, evalue
k119_5330       MH878976.1      11120   Infectious bronchitis virus     2783    27467   2783    99.928  100     5011    0.0
k119_5774       FJ904716.1      11120   Infectious bronchitis virus     15999   27629   16002   93.670  100     24288   0.0
k119_6018       MH878976.1      11120   Infectious bronchitis virus     8971    27467   8885    99.989  99      16019   0.0

High-Quality Bins

CheckV flagged the following bins as being of high quality:

# Fields: contig_id	contig_length	genome_copies	gene_count	viral_genes	host_genes	checkv_quality	miuvig_quality	completeness	completeness_method	contamination	provirus
checkv_bin_12/quality_summary.tsv:k119_5808     15205   1.0     6       6       0       High-quality    High-quality    99.74   AAI-based       0.0     No
checkv_bin_17/quality_summary.tsv:k119_6451     6458    1.0     3       3       0       High-quality    High-quality    99.35   AAI-based       0.0     No
checkv_bin_47/quality_summary.tsv:k119_5360     6670    1.0     3       3       0       High-quality    High-quality    100.0   AAI-based       0.0     No

Looking up their annotation from Blast:

# Fields: query acc.ver, subject acc.ver, subject tax id, subject com names, query length, subject length, alignment length, % identity, % query coverage per subject, bit score, evalue
k119_5808       MH996950.1      11176   Avian avulavirus 1      15205   15147   15138   99.908  99      27237   0.0
k119_6451       KP747574.1      1239567 Mamastrovirus 3 6458    6402    6400    95.828  99      10356   0.0
k119_5360       MK613068.1      1105379 Porcine astrovirus 4    6670    6733    6770    77.947  99      5383    0.0

So we recovered two high-quality genomes, one for Avian avulavirus 1 and one for Mamastrovirus 3, which have high identity to the reference database. And we recovered a third high-quality genome, Porcine astrovirus 4, which has 77.95 % identity to the closest reference sequence in the database.

Code & Data Files

The following data files are available at the S3 URI below:

  • Two individual assemblies
  • The co-assembly
  • Binning & CheckV output for the co-assembly
  • Blast results for the co-assembly
  • Rough shell script documenting all of the processing steps taken
s3://serratus-public/notebook/200526_assembly/ta_cml_rm/