SQL Schema - ababaian/serratus GitHub Wiki

Overview

Serratus SQL Database is an ordered and accessible way to access the entire collection of Viral Summary Reports) in the project along with additional metadata.

There are two primary ways to access the SQL server:

  1. Tantalus, the R-package companion to Serratus
  2. pgadmin4 or your favorite PostgreSQL program

Connection Details

  • Endpoint: serratus-aurora-20210406.cluster-ro-ccz9y6yshbls.us-east-1.rds.amazonaws.com
  • Database: summary
  • Username: public_reader
  • Password: serratus

SQL Schema - Nucleotide

The nucleotide (nt) search was the first large SRA search using the Serratus architecture. Read more about it and explore results at https://serratus.io/explorer/nt.

nsra

nsra is the first line of the summary file, where the data for the SRA, reference genome, and date is present. This table has a one-to-many relationship with the other tables, all linked by the SRA accession run_id.

Columns:

  • run_id: SRA run accession for the analysis represented in the summary file
  • read_length: ?
  • genome: Name of the reference genome used for analysis
  • date: Date the summary file was generated

nfamily

nfamily corresponds to per-viral family pangenome. Reads from all virus accessions (nsequence) are pooled to increase sensitivity (especially for divergent viruses). Detecting a virus of interest is best done initially on the family level, followed by a specific search within matching entries.

Columns:

  • run_id: SRA run accession for the analysis represented in the summary file
  • family_name: Name of the family of the pan-genome that is being analyzed
  • coverage_bins: Coverage cartoon generated, giving a picture of the quality of alignment throughout the specific sequence
  • score: Score given for the quality of the alignment
  • percent_identity: Percent identity of the sequences aligned (wrt the reference genome)
  • depth: Depth of the sequence
  • n_reads: Number of aligned reads
  • n_global_reads: Number of global aligned reads
  • length: Pangenome length
  • top_genbank_id: Top accession
  • top_score: Top accession score
  • top_length: Length of the top accession
  • top_name: Study name linking to the top accession

nsequence

nsequence corresponds to lowest level of sequence-matches. Each reference sequence with mapped reads is an entry.

Columns:

  • run_id: SRA run accession for the analysis represented in the summary file
  • family_name: Name of the family of the pan-genome that is being analyzed
  • sequence_accession: Name of the accession
  • coverage_bins: Coverage cartoon generated, giving a picture of the quality of alignment throughout the specific sequence
  • score: Score given for the quality of the alignment
  • percent_identity: Percent identity of the sequences aligned (wrt the reference genome)
  • depth: Depth of the sequence
  • n_reads: Number of aligned reads
  • n_global_reads: Number of global aligned reads (excludes soft-clipped reads)
  • length: Length of accession
  • virus_name: Study name linking to the accession

SQL Schema - RdRp

The RdRp (rdrp) search was another large SRA search using the Serratus architecture. Read more about it and explore results at https://serratus.io/explorer/rdrp.

rsra

This is the same as the nsra table but for the RdRp search.

rphylum

This is analagous to the nfamily table but for the RdRp search and on the phylum level.

rfamily

This is the same as the nfamily table but for the RdRp search.

rsequence

This is the same as the nsequence table but for the RdRp search.