Assembly Data - ababaian/serratus GitHub Wiki
Accessing Assembly Data
This directory stores all assembly data generated by Serratus. This includes the coronaviridae-assemblage as well as targeted assemblies of other viral families we have done.
SRA Assemblies Master List: s3://lovelywater2/aindex.tsv
s3://lovelywater2/ # A Read-Only Archive of Serratus Data Releases
├── assembly/ # Viral assembly and annotation data
│ └─── cov/ # .fasta : Assembled/filtered coronaviruses
│ └─── contigs/ # CoronaSPAdes output, contigs, graphs, stats...
│ └─── annotation/ # CoV annotation and taxonomic assignments
│ └─── micro/ # Micro-assembly contigs (e.g. RdRP)
│ └─ rdrp1/ # contigs per individual run (for web ui)
Target assemblies
assembly/cov
:
Coronaviridae
extracted contigs (11,120) made with coronaSPAdes
, where contigs have been filtered either using CheckV or using coronaSPAdes' bgc-statistics. See Serratus manuscript for more details.
assembly/micro
Reads mapping to a reference sequence (rdrp1
) are aligned in isolation to yield a targetted or "micro-assembly". These files may contain non-RdRP or off-target assembled sequences. These also may contain "deep" RdRP sequences not yet recognized by HMM
or palmscan
See Serratus manuscript for more details.
The sub-folders contain the same data but "expanded" to one file per individual SRA run meant for the website UI.
palmprints
RdRP For a validated and unique set of RdRP (i.e. 130K novel + 15K known sOTU), a barcode sub-sequence of RdRp called the palmprint
is extracted. The collection of all RdRP palmprints
is stored in the PALMdb
repository. See Palmprint manuscript for details.
De novo assemblies
Whole library assemblies. These are combined from several ongoing and past experiments.
assembly/contigs
:
SRRXXXXXX.[assembler].assembly_graph_with_scaffolds.gfa.gz
SRRXXXXXX.[assembler].bgc_statistics.txt
SRRXXXXXX.[assembler].contigs.fa.mfc
SRRXXXXXX.[assembler].domain_graph.dot
SRRXXXXXX.[assembler].gene_clusters.fa
SRRXXXXXX.[assembler].scaffolds.fasta.gz
SRRXXXXXX.[assembler].scaffolds.paths
SRRXXXXXX.[assembler].log
SRRXXXXXX.[assembler].txt
All of these are [assembler] outputs, where [assembler] is either coronaSPAdes or rnaviralSPAdes.
Depending on the assembler, a subset of these files will be present for each accession.
Beware: contigs.fa.mfc
actually contains the content of coronaSPAdes' scaffolds.fasta
compressed with MFCompress.
assembly/annotation
:
This folder contains the annotation results of several programs applied to different inputs.
CheckV applied to the scaffolds.fasta
and/or gene_clusters.fasta
:
SRRXXXXXX.[assembler].checkv.completeness.tsv.gz
SRRXXXXXX.[assembler].checkv.contamination.tsv.gz
SRRXXXXXX.[assembler].checkv.quality_summary.tsv.gz
SRRXXXXXX.[assembler].gene_clusters.checkv.completeness.tsv.gz
SRRXXXXXX.[assembler].gene_clusters.checkv.contamination.tsv.gz
SRRXXXXXX.[assembler].gene_clusters.checkv.quality_summary.tsv.gz
serraplace (phylo placement) output of CheckV-filtered gene clusters:
SRRXXXXXX.[assembler].gene_clusters.checkv_filtered.fa.serraplace.tar.gz
serratax (taxonomic identification) output of CheckV-filtered gene clusters:
SRRXXXXXX.[assembler].gene_clusters.checkv_filtered.fa.serratax.final
SRRXXXXXX.[assembler].gene_clusters.checkv_filtered.fa.serratax.tar.gz
Then, the following are annotations of the assemblies in cov/
. They include the outputs of Darth, a pipeline created within Serratus for annotation of coronavirus assemblies.
SRRXXXXXX.fa.darth.alignments.fasta
SRRXXXXXX.fa.darth.alignments.sto
SRRXXXXXX.fa.darth.input_md5
SRRXXXXXX.fa.darth.stripped.tar.gz
SRRXXXXXX.fa.darth.tar.gz
SRRXXXXXX.fa.darth.transeq.alignments.fasta
SRRXXXXXX.fa.serraplace.tar.gz
SRRXXXXXX.fa.serratax.final
SRRXXXXXX.fa.serratax.tar.gz
See also: Accessing Serratus Data