SQL Schema - ababaian/serratus GitHub Wiki
Overview
Serratus
SQL Database is an ordered and accessible way to access the entire collection of Viral Summary Reports) in the project along with additional metadata.
There are two primary ways to access the SQL server:
Connection Details
- Endpoint:
serratus-aurora-20210406.cluster-ro-ccz9y6yshbls.us-east-1.rds.amazonaws.com
- Database:
summary
- Username:
public_reader
- Password:
serratus
SQL Schema - Nucleotide
The nucleotide (nt
) search was the first large SRA search using the Serratus architecture. Read more about it and explore results at https://serratus.io/explorer/nt.
nsra
nsra
is the first line of the summary file, where the data for the SRA, reference genome, and date is present. This table has a one-to-many relationship with the other tables, all linked by the SRA accession run_id
.
Columns:
run_id
: SRA run accession for the analysis represented in the summary fileread_length
: ?genome
: Name of the reference genome used for analysisdate
: Date the summary file was generated
nfamily
nfamily
corresponds to per-viral family pangenome. Reads from all virus accessions (nsequence
) are pooled to increase sensitivity (especially for divergent viruses). Detecting a virus of interest is best done initially on the family level, followed by a specific search within matching entries.
Columns:
run_id
: SRA run accession for the analysis represented in the summary filefamily_name
: Name of the family of the pan-genome that is being analyzedcoverage_bins
: Coverage cartoon generated, giving a picture of the quality of alignment throughout the specific sequencescore
: Score given for the quality of the alignmentpercent_identity
: Percent identity of the sequences aligned (wrt the reference genome)depth
: Depth of the sequencen_reads
: Number of aligned readsn_global_reads
: Number of global aligned readslength
: Pangenome lengthtop_genbank_id
: Top accessiontop_score
: Top accession scoretop_length
: Length of the top accessiontop_name
: Study name linking to the top accession
nsequence
nsequence
corresponds to lowest level of sequence-matches. Each reference sequence with mapped reads is an entry.
Columns:
run_id
: SRA run accession for the analysis represented in the summary filefamily_name
: Name of the family of the pan-genome that is being analyzedsequence_accession
: Name of the accessioncoverage_bins
: Coverage cartoon generated, giving a picture of the quality of alignment throughout the specific sequencescore
: Score given for the quality of the alignmentpercent_identity
: Percent identity of the sequences aligned (wrt the reference genome)depth
: Depth of the sequencen_reads
: Number of aligned readsn_global_reads
: Number of global aligned reads (excludes soft-clipped reads)length
: Length of accessionvirus_name
: Study name linking to the accession
SQL Schema - RdRp
The RdRp (rdrp
) search was another large SRA search using the Serratus architecture. Read more about it and explore results at https://serratus.io/explorer/rdrp.
rsra
This is the same as the nsra
table but for the RdRp search.
rphylum
This is analagous to the nfamily
table but for the RdRp search and on the phylum level.
rfamily
This is the same as the nfamily
table but for the RdRp search.
rsequence
This is the same as the nsequence
table but for the RdRp search.