Access Data Release - ababaian/serratus GitHub Wiki
Current Version: v230110
Versioned and structured data releases are freely hosted on AWS S3 in our data-warehouse: "lovelywater2".
Unstructured data and intermediate files are in the Working Data Directories.
- Search sequence references
- SRA Run Info Queries
- Summary-level data
- Alignment-level data (
.bam
or.pro
, see notes below) - Assembly-level data
- RdRP barcode sequences (
PALMdb
)
## Folder organization
NEW/UPDATED
s3://lovelywater2/ # A Read-Only Archive of Serratus Data Releases
⦿ Common files
├── assembly/ # Viral assembly and annotation data
│ └─── cov/ # .fasta : Assembled/filtered coronaviruses
│ └─── contigs/ # CoronaSPAdes output, contigs, graphs, stats...
│ └─── annotation/ # CoV annotation and taxonomic assignments
├── seq/ # Reference sequences used in data-releases
│ └─── cov3ma/ # Nucleotide viral pangenome
│ └─── protref5/ # Protein viral panproteome
│ └─── rdrp1/ # viral RNA dependent RNA polymerase collection 1
│ └─── rdrp5/ # dark RNA dependent RNA polymerase collection 5 ***
├── sra/ # sraRunInfo.csv files and queries for data (per query)
│ └─── README.md # see github.com/ababaian/serratus/wiki/SRA-queries ***
│ └─── *query* # (see below) ***
⦿ Nucleotide search files
├── bam/ # .bam : Aligned files
├── summary/ # .summary: Original alignment summaries (deprecated)
├── summary2/ # .summary: Alignment summaries
⦿ Translated-nucleotide (protein) search files
├── pro/ # .pro.gz : Translated-nucleotide alignments (diamond)
├── psummary/ # .psummary: Protein
⦿ RdRP 1 translated-nucleotide search files
├── rpro/ # .pro.gz : Aligned files ***
├── rsummary/ # .psummary: Alignment summaries for rdrp-search ***
⦿ Dark RdRP 5 translated-nucleotide search files
├── dpro/ # .pro.gz : Aligned files ***
├── dsummary/ # .psummary: Alignment summaries for rdrp-search ***
⦿ Index Files
├ index.tsv # Index file of completed SRA accessions
├ pindex.tsv # Index file of completed protein SRA accessions
├ rindex.tsv # Index file of completed rdrp SRA accessions ***
├ dindex.tsv # Index file of completed dark rdrp SRA accessions ***
├ LICENSE.md #
└ README.md # This README.md **
s3://lovelywater2/sra/
* QUERY SETS *
├ v201210/ # Query sets from major version v210225 and prior
├ v220113/ # Query sets from major version v210225
└ v230116_SraRunInfo.csv # master query CSV for v230116 ***
See also: SRA Query Sets
All folders are flat, with files named {sra_accession}.{ext}
For example, the SRA library SRA123456
processed in the 'viro' query will have the files:
- s3://lovelywater2/bam/SRA123456.bam
- s3://lovelywater2/summary/SRA123456.summary
- s3://lovelywater2/assembly/contigs/SRA123456.coronaSPAdes.gene_clusters.fa
The S3 bucket has public read-only permissions. All files can be downloaded via aws cli
or wget/curl
.
-
aws-cli
:aws s3 cp s3://lovelywater2/<file_path>
. -
wget
/curl
:wget https://lovelywater2.s3.amazonaws.com/<file_path>
To find or access a sub-set of data use the index file:
`aws s3 cp s3://lovelywater2/index.tsv ./`
`grep "SRR1234" index.tsv > matches`
`aws s3 cp --recursive -exclude "*" -include "SRR1234*" s3://lovelywater2/summary/ ./SRR1234/`
As of version 20200821
, all .bam
files are sorted and have an associated .bai
index file in the ~/bam/
directory. These alignment files can be visualized directly in a genome browswer such as igv
using the cov3ma
as reference genome.
IGV Stream Alignment: File
--> Load from URL
--> https://lovelywater2.s3.amazonaws.com/bam/ERR2756788.bam
You can then navigate to a relevant accession such as "EU769558.1" and directly vizualize read alignments.
Translated-nucleotide alignment data are saved as (.pro
), the output of diamond -f 6
with the following ordered-fields.
qseqid qstart qend qlen qstrand sseqid sstart send slen pident evalue cigar qseq_translated full_qseq full_qseq_mate
(See also: Diamond Wiki)
FASTA assemblies are compressed using MFCompress.
# Quick install (linux 64bit)
wget http://sweet.ua.pt/ap/software/mfcompress/MFCompress-linux64-1.01.tgz
tar -xvf MFCompress-linux64-1.01.tgz
cp MFC*/MFC* ./; rm -rf MFCompress-linux64-1.01
# Decompress
MFCompressD SRR01234.fa.mfc
All data released in s3://lovelywater2/
is done so under the cc0 v1.0
license as defined in s3://lovelywater2/LICENSE.md
.
- Coronavirus assemblies (11K, revision 1)
- RdRP micro-assembly contigs (from 5.7M library search)
- RdRp contigs, from assemblies
- MmonDV: Marmota monax Deltavirus
- OvirDV: Odocoileus virginianus Deltavirus
- PmacDV: Peropteryx macrotis Deltavirus
- TgutDV: Taeniopygia guttata Deltavirus
- BglaDV: Benthosema glaciale Deltavirus
- IchiDV: Indirana chiravasi Deltavirus
- Ribozyviria Assemblies (Delta/Zetaviruses)
PALMdb is a database of viral polymerase palmprint (barcode) sequences classified by (1) taxonomy and (2) species-like operational taxonomic units (OTUs) obtained by clustering at 90% sequence identity. PALMdb was created using the palmscan algorithm to mine public sequence databases and Serratus contigs. The 2021-03-14 update includes 250,799 novel Serratus palmprint sequences, representing 132,992 new OTUs.