Metagenomic Co Assembly Proof of Concept - ababaian/serratus GitHub Wiki
By Tomer Altman, Connor Morgan-Lang, and Ryan McLaughlin**
(** Big thanks to Ryan for doing a ton of work to put all the data together!)
Abstract
We set out to demonstrate that when you co-assemble read sets from many samples from a common source, you can recover more complete genomes, and genomes of rare members of the community. To this end, we fetched all 50 runs of PRJNA602689
, which contains runs SRR10951656
and SRR10951660
; the "low coverage Infectious Bronchitis Virus in Sus scrofa" study. So while we did not perform assembly for all of the benchmark datasets, we zeroed in on a particular use case to see if we could recover additional viral contigs. Below we present our approach, our findings, and instructions on fetching the data resources.
Methods
Samples were downloaded from the SRA using the prefetch (v2.8.0) and fastq files extracted using fastq-dump (v2.8.0). The Sus scrofa reference genome (GCF_000003025.6) was downloaded from NCBI. The Sus scrofa genome was indexed for bmfilter (v.3.102.4) and srprism (2.4.24-alpha) and a blastdb was created for blastn (2.5.0+), all for use by bmtagger (1.1.0). Paired-end reads for all samples were run through bmtagger to remove contaminating reads from the host (Sus s.). Blacklisted reads were filtered from all fastqs using seqmagick (v0.8.0). Reads were trimmed for quality using bbduk (v38.79). Finally all QC'ed samples were concatenated then co-assembled with MegaHIT (v1.2.9). Binning was preformed on contigs from the co-assembly with MetaBat2 (v2.12.1). Bin quality, presence and quality of recovered viral contigs, was assessed using CheckV (v0.6.0). We ran NCBI Blast in blastn mode
against a local copy of the NT database, restricting the search to only viral entries, to annotate the contigs.
The entire workflow except for the Blast annotation was performed at a UBC computing grid, on a machine with 8 cores and approximately 64 GB RAM.
Results
Pipeline Performance
Read filtering and trimming took approximately 45 minutes and <16 GB RAM. The combined QC'ed read pairs numbered 30,152,318, and took up 15.3 GB on disk. It took 127 minutes to run the co-assembly, with an approximate RAM utilization of 53 GB and resulted in 2,395 contigs of length greater than 500bp. The binning took 20 minutes and generated 70 bins. Running CheckV on all of the bins took <1 minute and <1 GB RAM per bin.
Quality metric output from MegaHIT for the co-assembly:
2395 contigs, total 1818119 bp, min 500 bp, max 15999 bp, avg 759 bp, N50 702 bp
Time elapsed: 2346.165702 seconds
Quality metric output from MegaHIT for run SRR10951656
:
2020-06-04 06:37:36 - 62 contigs, total 48807 bp, min 501 bp, max 2010 bp, avg 787 bp, N50 760 bp
2020-06-04 06:37:36 - ALL DONE. Time elapsed: 25.971600 seconds
Quality metric output from MegaHIT for run SRR10951660
:
2020-06-04 06:39:34 - 49 contigs, total 38468 bp, min 502 bp, max 3121 bp, avg 785 bp, N50 783 bp
2020-06-04 06:39:34 - ALL DONE. Time elapsed: 28.095327 seconds
Viral Findings
The following table lists the name of the organism, and the cumulative length of contigs with annotations to that organism, for organisms with a cumulative length greater than 1,000:
Infectious bronchitis virus 27753
Dickeya phage phiDP10.3 15708
Avian avulavirus 1 15205
Porcine astrovirus 4 10996
Moraxella phage Mcat16 8915
uncultured human fecal virus 8261
Mamastrovirus 3 6978
Pseudomonas phage PN05 4099
Epstein-Barr virus 2652
Pseudomonas phage vB_Pae_CF53a 1336
Porcine bocavirus 1221
Porcine rotavirus 1172
We discuss a select number of these sequences below.
Infectious Bronchitis Virus
We recovered a bin of IBV with 27,753 bp across three contigs of length 15,999, 8,971, 2,783, respectively. While the right target size for a coronavirus, CheckV declared the largest contig of medium-quality, and the other contigs of low quality, despite the Blast results showing high identity:
# Fields: query acc.ver, subject acc.ver, subject tax id, subject com names, query length, subject length, alignment length, % identity, % query coverage per subject, bit score, evalue
k119_5330 MH878976.1 11120 Infectious bronchitis virus 2783 27467 2783 99.928 100 5011 0.0
k119_5774 FJ904716.1 11120 Infectious bronchitis virus 15999 27629 16002 93.670 100 24288 0.0
k119_6018 MH878976.1 11120 Infectious bronchitis virus 8971 27467 8885 99.989 99 16019 0.0
High-Quality Bins
CheckV flagged the following bins as being of high quality:
# Fields: contig_id contig_length genome_copies gene_count viral_genes host_genes checkv_quality miuvig_quality completeness completeness_method contamination provirus
checkv_bin_12/quality_summary.tsv:k119_5808 15205 1.0 6 6 0 High-quality High-quality 99.74 AAI-based 0.0 No
checkv_bin_17/quality_summary.tsv:k119_6451 6458 1.0 3 3 0 High-quality High-quality 99.35 AAI-based 0.0 No
checkv_bin_47/quality_summary.tsv:k119_5360 6670 1.0 3 3 0 High-quality High-quality 100.0 AAI-based 0.0 No
Looking up their annotation from Blast:
# Fields: query acc.ver, subject acc.ver, subject tax id, subject com names, query length, subject length, alignment length, % identity, % query coverage per subject, bit score, evalue
k119_5808 MH996950.1 11176 Avian avulavirus 1 15205 15147 15138 99.908 99 27237 0.0
k119_6451 KP747574.1 1239567 Mamastrovirus 3 6458 6402 6400 95.828 99 10356 0.0
k119_5360 MK613068.1 1105379 Porcine astrovirus 4 6670 6733 6770 77.947 99 5383 0.0
So we recovered two high-quality genomes, one for Avian avulavirus 1 and one for Mamastrovirus 3, which have high identity to the reference database. And we recovered a third high-quality genome, Porcine astrovirus 4, which has 77.95 % identity to the closest reference sequence in the database.
Code & Data Files
The following data files are available at the S3 URI below:
- Two individual assemblies
- The co-assembly
- Binning & CheckV output for the co-assembly
- Blast results for the co-assembly
- Rough shell script documenting all of the processing steps taken
s3://serratus-public/notebook/200526_assembly/ta_cml_rm/