Assembly Data - ababaian/serratus GitHub Wiki

Accessing Assembly Data

This directory stores all assembly data generated by Serratus. This includes the coronaviridae-assemblage as well as targeted assemblies of other viral families we have done.

SRA Assemblies Master List: s3://lovelywater2/aindex.tsv

s3://lovelywater2/     # A Read-Only Archive of Serratus Data Releases
├── assembly/         # Viral assembly and annotation data
│   └─── cov/         # .fasta  : Assembled/filtered coronaviruses
│   └─── contigs/     # CoronaSPAdes output, contigs, graphs, stats...
│   └─── annotation/  # CoV annotation and taxonomic assignments
│   └─── micro/       # Micro-assembly contigs (e.g. RdRP)
│     └─ rdrp1/       # contigs per individual run (for web ui)

Target assemblies

assembly/cov:

Coronaviridae extracted contigs (11,120) made with coronaSPAdes, where contigs have been filtered either using CheckV or using coronaSPAdes' bgc-statistics. See Serratus manuscript for more details.

assembly/micro

Reads mapping to a reference sequence (rdrp1) are aligned in isolation to yield a targetted or "micro-assembly". These files may contain non-RdRP or off-target assembled sequences. These also may contain "deep" RdRP sequences not yet recognized by HMM or palmscan See Serratus manuscript for more details.

The sub-folders contain the same data but "expanded" to one file per individual SRA run meant for the website UI.

RdRP palmprints

For a validated and unique set of RdRP (i.e. 130K novel + 15K known sOTU), a barcode sub-sequence of RdRp called the palmprint is extracted. The collection of all RdRP palmprints is stored in the PALMdb repository. See Palmprint manuscript for details.

De novo assemblies

Whole library assemblies. These are combined from several ongoing and past experiments.

assembly/contigs:

SRRXXXXXX.[assembler].assembly_graph_with_scaffolds.gfa.gz
SRRXXXXXX.[assembler].bgc_statistics.txt
SRRXXXXXX.[assembler].contigs.fa.mfc
SRRXXXXXX.[assembler].domain_graph.dot
SRRXXXXXX.[assembler].gene_clusters.fa
SRRXXXXXX.[assembler].scaffolds.fasta.gz
SRRXXXXXX.[assembler].scaffolds.paths
SRRXXXXXX.[assembler].log
SRRXXXXXX.[assembler].txt

All of these are [assembler] outputs, where [assembler] is either coronaSPAdes or rnaviralSPAdes. Depending on the assembler, a subset of these files will be present for each accession. Beware: contigs.fa.mfc actually contains the content of coronaSPAdes' scaffolds.fasta compressed with MFCompress.

assembly/annotation:

This folder contains the annotation results of several programs applied to different inputs.

CheckV applied to the scaffolds.fasta and/or gene_clusters.fasta:

SRRXXXXXX.[assembler].checkv.completeness.tsv.gz
SRRXXXXXX.[assembler].checkv.contamination.tsv.gz
SRRXXXXXX.[assembler].checkv.quality_summary.tsv.gz
SRRXXXXXX.[assembler].gene_clusters.checkv.completeness.tsv.gz
SRRXXXXXX.[assembler].gene_clusters.checkv.contamination.tsv.gz
SRRXXXXXX.[assembler].gene_clusters.checkv.quality_summary.tsv.gz

serraplace (phylo placement) output of CheckV-filtered gene clusters:

SRRXXXXXX.[assembler].gene_clusters.checkv_filtered.fa.serraplace.tar.gz

serratax (taxonomic identification) output of CheckV-filtered gene clusters:

SRRXXXXXX.[assembler].gene_clusters.checkv_filtered.fa.serratax.final
SRRXXXXXX.[assembler].gene_clusters.checkv_filtered.fa.serratax.tar.gz

Then, the following are annotations of the assemblies in cov/. They include the outputs of Darth, a pipeline created within Serratus for annotation of coronavirus assemblies.

SRRXXXXXX.fa.darth.alignments.fasta
SRRXXXXXX.fa.darth.alignments.sto
SRRXXXXXX.fa.darth.input_md5
SRRXXXXXX.fa.darth.stripped.tar.gz
SRRXXXXXX.fa.darth.tar.gz
SRRXXXXXX.fa.darth.transeq.alignments.fasta
SRRXXXXXX.fa.serraplace.tar.gz
SRRXXXXXX.fa.serratax.final
SRRXXXXXX.fa.serratax.tar.gz

See also: Accessing Serratus Data