Proteome guided assembly for high divergence low coverage genomes - ababaian/serratus GitHub Wiki

RCE suggestion for assembly protocol. There may be existing tools for this, if you know of one please update wiki or email me [email protected] if you don't have rights. Otherwise, I can hack something in day or two.

We always have at least one nucleotide alignment to a known virus genome, call this G. Let the SRA dataset be S.

Take the amino acid proteome of G, or a close neighbor if more convenient; call this P. If we don't have a good proteome, make P by extracting all ORFs from the nt sequence of G.

Align all reads of S to P using translated blast (tblast).

This search will produce a set of read sequences (V) which is highly enriched for virus and strongly filtered for host. In most cases, it should catch all available CDS for V.

This search will be much more sensitive to highly diverged viruses than bowtie2.

It is easy to automate in the cloud without knowing anything about the host.

The tblast alignments enable consensus nt and aa sequences scaffolded to G. This will work even if coverage is <<1; we will get a bunch of amino acid XXXs similar to NNNs for nt assemblies. This captures amino acid sequence in cases where de novo assembly will fail completely because there are no overlaps.

PRICE can extend V into domains which are not alignable by tblast, if any, and to distill the reads into contigs if possible.