SerratusSQL - ababaian/serratus GitHub Wiki

Overview

Serratus data is stored in a series of tables inside a relational database. It provides an accessible way to access the entire collection of Viral Summary Reports in the project along with additional metadata.

Access Credentials

  • Username: public_reader
  • Password: serratus
  • Endpoint: serratus-aurora-20210406.cluster-ro-ccz9y6yshbls.us-east-1.rds.amazonaws.com
  • Port: 5432 (default)
  • Database: summary

Client Applications

To access Serratus data, you could use either:

  1. psql, from the command line. A handy one-liner to connect to the db is:

    psql postgresql://public_reader:serratus@serratus-aurora-20210406.cluster-ro-ccz9y6yshbls.us-east-1.rds.amazonaws.com/summary

  2. Tantalus, the R-package companion to Serratus

  3. pgadmin4 or your favorite PostgreSQL GUI

Query Sets

Serratus is made up of several datasets which are described below, and referenced later in the table schemas.

  • "Pilot study" query set: reated by aligning a set of SRA sequence to a curated set of know (translated-)protein sequences for virii, (list here). All tables that start with "p..." are derived from the pilot study query set.
    Read more at ref_protref5.

  • NT query set: created by aligning ~3.8 million SRA sequences to a set of known complete genomes from all RefSeq vertebrate viruses (excluding retroviruses) and ~10k known Coronaviridae sequences. All tables that start with "n..." are derived from the NT query set.
    Read more at red_cov3ma.

  • RdRp query set: created by aligning ~5.7 million SRA sequences to a set of known RNA-dependent RNA polymerase sequences. All tables that start with "r..." are derived from the RdRp query set.
    Read more at ref_rdrp1.

  • Dark RdRp query set: an extension to the RdRp query set that also includes sequences that could be remote orthologs of RdRps or that most likely perform a similar function. All tables that start with "d..." are derived from the RdRp query set.
    Read more at ...
    (Artem: does this include dark rdrp data as well?)

Table Schemas

SerratusSQL encourages adherence to the SQL Style Guide where possible.

(Notes for maintainers)

  • Primary Data Data Tables: should represent external data-sources (.fa, .pro, .summary, ...) as closely to the source material as possible, and do not process, mix, add columns.

  • Secondary Data Materialized Views: should represent processed, filtered, and/or useful combinations of data tables. Useful is defined as a performing a slow/common operation and/or serving an API-endpoint which combines data from several `data tables.

For each table or view in SerratusSQL, include the following minimal set of information:

* Name and is it a data table or a materialized view?

* Summary of the contents and purpose of the table.

* Where is the origin of the data, and when was the data last updated?

* A list of table column names along, their data type (not rigorous [boolean, datetime, json, number, text]) and a brief explanation of what they are.

* Maintainer tasks (if applicable).


Data Tables (Alphabetical)

biosample biosample_geocode biosample_sex biosample_temperature biosample_tissue biosample_tissue_tags dfamily dphylum dsequence dsra nfamily nsequence nsra palm_acc palm_gb palm_graph palm_sra palm_sra2 palmdb palmdb2 pfamily protein psequence psra rfamily rphylum rsequence rsra sra_stat sra_stat_group srarun tax_lineage tax_names tax_nodes tismap_summary tissue_map

Materialized Views (Alphabetical)

analysis_index biosample_geo_coordinates nfamily_counts nfamily_list nsequence_counts nsequence_list palm_tax palm_virome rfamily_counts rfamily_list rphylum_counts rphylum_list rsequence_counts rsequence_list srarun_geo_coordinates

Deprecated Tables (Alphabetical, Scheduled for Removal)

biosample_tissue_OLD rdrp_pos srarun_v1


analysis_index (materialized view)

A boolean index of which sra run_id has been analyzed in the different major analyses of the Serratus project. This table is an API-endpoint on serratus.io to serve which buttons/data-links to display for a given input SRA run.

The materialized view is generated by the following SQL query:

SELECT run_id,
  (s.run_id IS NOT NULL) AS srarun,
  (nsra.run_id IS NOT NULL) AS nsra,
  (psra.run_id IS NOT NULL) AS psra,
  (rsra.run_id IS NOT NULL) AS rsra,
  (dsra.run_id IS NOT NULL) AS dsra,
  (sgc.run_id IS NOT NULL) AS geo,
  (upl.micro IS TRUE) AS micro,
  (upl.assembly IS TRUE) AS assembly,
  upl.assembly_file
FROM ((((((( SELECT srarun.run AS run_id FROM srarun) s
  FULL JOIN nsra USING (run_id))
  FULL JOIN rsra USING (run_id))
  FULL JOIN psra USING (run_id))
  FULL JOIN dsra USING (run_id))
  FULL JOIN srarun_geo_coordinates sgc USING (run_id))
  FULL JOIN assembly_index upl USING (run_id));
  • run_id (text): Run ID from NCBI's SRA
  • srarun (boolean): true if run_id is present on the srarun table
  • nsra (boolean): true if run_id is present on the nsra table
  • psra (boolean): true if run_id is present on the psra table
  • rsra (boolean): true if run_id is present on the rsra table
  • dsra (boolean): true if run_id is present on the dsra table
  • geo (boolean): true if run_id is present on the srarun_geo_coordinates table
  • micro (boolean): true if run_id is present on the assembly_index table
  • assembly (boolean): true if run_id is present on the srarun table
  • assembly_file (text): if there's an assembly, the filename where it is stored

assembly_index (data table)

NOT PUBLIC

(Alex: Document)

(Artem: Why is this not public? Should it be?)

biosample (data table)

Geographical information extracted from the metadata attributes found on the BioSample(s) corresponding to each SRA run.

BioSample attributes make up an extensive list. Harmonized names were used to extract the values present on this table, as it's described below. (Artem: Is this true?).

(Artem: Is this code somewhere?)

  • biosample_id (text): BioSample ID from NCBI's BioSample
  • run_id (text): Run ID from NCBI's SRA
  • geo_coord_all (jsonb): An object with all the attributes that reference a point on Earth
  • geo_text_all (jsonb): An object with all the attributes that reference a location by it's name
  • collection_date (text): Date on which the sample was collected, as reported on the collection_date harmonized attribute from the BioSample metadata.
  • geo_coordinate_x (number): Longitude component found on geo_coord_all (if present)
  • geo_coordinate_y (number): Latitude component found on geo_coord_all (if present)
  • geo_text_extracted (text): Location name derived from geo_text_all (if present)

biosample_geocode (data table)

Coordinates corresponding to locations names inferred on biosample.geo_text_extracted.

The coordinates were found by ... (Artem: ???)

  • geo_text_extracted (text): Set of values from biosample.geo_text_extracted
  • coordinate_x (number): Longitude component of location
  • coordinate_y (number): Latitude component of location

biosample_geo_coordinates (materialized view)

Geographic location data for all BioSample(s) whose runs are present on the Serratus project.

This materializedd view combines the results of:

* biosample: coordinates that were explicitly annotated * biosample_geocode: coordinates that were derived from a location name on biosample.geo_text_extracted

The materialized view is generated by the following SQL query:

SELECT b.biosample_id,
  bgeo.coordinate_x,
  bgeo.coordinate_y,
  b.geo_text_extracted AS from_text
FROM (biosample b
  JOIN biosample_geocode bgeo ON ((b.geo_text_extracted = bgeo.geo_text_extracted)))
WHERE (bgeo.coordinate_x IS NOT NULL)
UNION ALL
  SELECT biosample.biosample_id,
    biosample.geo_coordinate_x AS coordinate_x,
    biosample.geo_coordinate_y AS coordinate_y,
    NULL::text AS from_text
  FROM biosample
  WHERE (biosample.geo_coordinate_x IS NOT NULL);
  • biosample_id (text): BioSample ID from NCBI's BioSample
  • coordinate_x (number): Longitude component of location
  • coordinate_y (number): Latitude component of location
  • from_text (text): The location name the coordinates were inferred from, as derived on biosample.geo_text_extracted

biosample_sex (data table)

Sex information extracted from the metadata attributes found on the BioSample(s) corresponding to each SRA run.

BioSample attributes make up an extensive list. Harmonized names were used to extract the values present on this table, as it's described below. The attributes considered are host_sex, sex. (Artem: Is this true?)

  • biosample_id (text): BioSample ID from NCBI's BioSample
  • run_id (text): Run ID from NCBI's SRA
  • male (boolean): true if any sexual attribute contain the term male
  • female (boolean): true if any sexual attribute contain the term female
  • other (boolean): true if any sexual attribute contain informative sexual terms which are not male or female. Examples: hermaphrodite, intersex, neuter.

biosample_temperature (data table)

Temperature information extracted from the metadata attributes found on the BioSample(s) corresponding to each SRA run.

BioSample attributes make up an extensive list. Harmonized names were used to extract the values present on this table, as it's described below. The attributes considered are ... (Artem: Where is this code? I'll get the attributes from there).

  • biosample_id (text): BioSample ID from NCBI's BioSample
  • run_id (text): Run ID from NCBI's SRA
  • temperature (number): Value of temperature in °C

biosample_tissue (data table)

Tissue information extracted from the metadata attributes found on the BioSample(s) corresponding to each SRA run.

BioSample attributes make up an extensive list. Harmonized names were used to extract the values present on this table, as it's described below. Many attributes were considered for this, to get a list of them run SELECT DISTINCT(source) FROM biosample_tissue;.

  • biosample_id (text): BioSample ID from NCBI's BioSample
  • srs_id (text): SRS (sequence entry) ID corresponding to the BioSample ID
  • source (text): BioSample attribute from where the tissue information was extracted from
  • text (text): Text of the aforementioned attribute
  • tissue (text): Tissue term that was extracted from the attribute's text
  • bto_id (text): Brenda Tissue Ontology ID of the corresponding tissue term

biosample_tissue_OLD (data table)

DEPRECATED: An early parsing of tissue metadata terms from BioSample and BTO. Use biosample_tissue instead.

  • biosample_id (text): BioSample ID from NCBI's BioSample
  • srs_id (text): SRS (sequence entry) ID corresponding to the BioSample ID.
  • tissue (text): Tissue term that was extracted from the attribute's text
  • bto_id (text): Brenda Tissue Ontology ID of the

biosample_tissue_tags (data table)

DEPRECATED? (Artem, Declan)

  • biosample_id (text): BioSample ID from NCBI's BioSample
  • biosample_tags (text): List of tissue-like(?) terms extracted from BioSample attributes
  • run (text): ???

dfamily (data table)

description

  • run_id (text): Run ID from NCBI's SRA
  • phylum_name (text): ...
  • family_name (text): ...
  • family_group (text): ...
  • coverage_bins (text): ...
  • score (number): ...
  • percent_identity (number): ...
  • depth (number): ...
  • n_reads (number): ...
  • aligned_length (number): ...

dphylum (data table)

description

  • run_id (text): Run ID from NCBI's SRA
  • phylum_name (text): ...
  • coverage_bins (text): ...
  • score (number): ...
  • percent_identity (number): ...
  • depth (number): ...
  • n_reads (number): ...
  • aligned_length (number): ...

dsequence (data table)

description

  • run_id (text): Run ID from NCBI's SRA
  • phylum_name (text): ...
  • family_name (text): ...
  • family_group (text): ...
  • virus_name (text): ...
  • sequence_accession (text): ...
  • coverage_bins (text): ...
  • score (number): ...
  • percent_identity (number): ...
  • depth (number): ...
  • n_reads (number): ...
  • aligned_length (number): ...

dsra (data table)

Summary files corresponding to the Dark RdRp query set. Columns/rows here were copied verbatim from the first lines of all summary files.

  • run_id (text): Run ID from NCBI's SRA
  • read_length (number): Average read length of all reads in the current SRA entry
  • genome (text): Name of the pangenome used for analysis (see Query Sets)
  • aligned_reads (number): Number of reads that were found alignmed to the query set
  • date (text): Date when the summary file was created
  • truncated (text): Whether or not the summary file is truncated

nfamily (data table)

description

  • run_id (text): Run ID from NCBI's SRA
  • family_name (text): ...
  • coverage_bins (text): ...
  • score (number): ...
  • percent_identity (number): ...
  • depth (number): ...
  • n_reads (number): ...
  • n_global_reads (number): ...
  • length (number): ...
  • top_genbank_id (text): ...
  • top_score (number): ...
  • top_length (number): ...
  • top_name (text): ...

nfamily_counts (materialized view)

description

The materialized view is generated by the following SQL query:

SELECT nfamily.family_name,
  nfamily.score,
  nfamily.percent_identity,
  count(*) AS count
FROM nfamily
GROUP BY nfamily.family_name, nfamily.score, nfamily.percent_identity;
  • family_name (text): ...
  • score (number): ...
  • percent_identity (number): ...
  • count (number): ...

nfamily_list (materialized view)

description

The materialized view is generated by the following SQL query:

SELECT nfamily.family_name
FROM nfamily
GROUP BY nfamily.family_name
ORDER BY nfamily.family_name;
  • family_name (text): ...

nsequence (data table)

description

  • run_id (text): Run ID from NCBI's SRA
  • family_name (text): ...
  • sequence_accession (text): ...
  • coverage_bins (text): ...
  • score (number): ...
  • percent_identity (number): ...
  • depth (number): ...
  • n_reads (number): ...
  • n_global_reads (number): ...
  • length (number): ...
  • virus_name (text): ...

nsequence_counts (materialized view)

description

The materialized view is generated by the following SQL query:

SELECT nsequence.sequence_accession,
  nsequence.score,
  nsequence.percent_identity,
  count(*) AS count
FROM nsequence
GROUP BY nsequence.sequence_accession, nsequence.score, nsequence.percent_identity;
  • sequence_accession (text): ...
  • score (number): ...
  • percent_identity (number): ...
  • count (number): ...

nsequence_list (materialized view)

description

The materialized view is generated by the following SQL query:

SELECT nsequence.sequence_accession,
  nsequence.virus_name
FROM nsequence
GROUP BY nsequence.sequence_accession, nsequence.virus_name
ORDER BY nsequence.sequence_accession;
  • sequence_accession (text): ...
  • virus_name (text): ...

nsra (data table)

Summary files corresponding to the NT query set. Columns/rows here were copied verbatim from the first lines of all summary files.

  • run_id (text): Run ID from NCBI's SRA
  • read_length (number): Average read length of all reads in the current SRA entry
  • genome (text): Name of the pangenome used for analysis (see Query Sets)
  • version (text): ...
  • date (text): Date when the summary file was created

palm_acc (data table)

description

  • row_id (number): ...
  • accession (text): ...
  • database (text): ...
  • palm_id (text): ...

palm_gb (data table)

description

  • row_id (number): ...
  • palm_id (text): ...
  • gb_acc (text): ...
  • percent_identity (number): ...
  • pp_cov (number): ...
  • evalue (number): ...
  • tax_id (number): ...

palm_graph (data table)

description

  • row_index (number): ...
  • palm_id1 (text): ...
  • palm_id2 (text): ...
  • pident (number): ...

palm_sra (data table)

description

  • row_id (number): ...
  • run_id (text): Run ID from NCBI's SRA
  • assembly_node (number): ...
  • coverage (number): ...
  • q_start (number): ...
  • q_end (number): ...
  • q_len (number): ...
  • q_strand (text): ...
  • palm_id (text): ...
  • pp_start (number): ...
  • pp_end (number): ...
  • pp_len (number): ...
  • percent_identity (number): ...
  • evalue (number): ...
  • cigar (text): ...
  • q_sequence (text): ...
  • qc_pass (boolean): ...

palm_sra2 (data table)

A version of palm_sra meant for development. Use palm_sra instead.

  • row_id (number): ...
  • run_id (text): Run ID from NCBI's SRA
  • assembly_node (number): ...
  • coverage (number): ...
  • q_start (number): ...
  • q_end (number): ...
  • q_len (number): ...
  • q_strand (text): ...
  • palm_id (text): ...
  • sotu (text): ...
  • pp_start (number): ...
  • pp_end (number): ...
  • pp_len (number): ...
  • percent_identity (number): ...
  • evalue (number): ...
  • qc_pass (boolean): ...
  • cigar (text): ...
  • q_sequence (text): ...

palm_tax (materialized view)

palm_tax is a materialized view combining the data-tables palmdb2 (palmdb sequences), with palm_gb (palmdb2 sequences aligned with DIAMOND against GenBank - July '23). The top-scoring hit in GenBank for each palmdb2 sequence is retained. For each palmDB sequenced aligned to GenBank, the ncbi_taxonomy for that GenBank sequence was extracted from the tax_lineage table.

palmDB to GenBank Alignment Command:

diamond blastp \
  -q unique.palmprint.faa \
  -d ~/nr/nr.dmnd \
  --masking seg --unal 1 \
  -p 14 -k 1 \
  -f 6 qseqid  qstart qend qlen \
       sseqid  sstart send slen \
       pident evalue \
       sscinames staxids sphylums  \
  > palmdb_to_gb.pro

The materialized view is generated by the following SQL query:

SELECT gb.palm_id,
  pp.sotu,
  pp.nickname,
  pp.centroid,
  gb.gb_acc,
  gb.percent_identity,
  gb.pp_cov,
  gb.evalue,
  tax.tax_id,
  tax.tax_kingdom,
  tax.tax_phylum,
  tax.tax_order,
  tax.tax_family,
  tax.tax_genus,
  tax.tax_species
FROM ((palm_gb gb
  LEFT JOIN palmdb2 pp ON ((gb.palm_id = pp.palm_id)))
  LEFT JOIN tax_lineage tax ON ((gb.tax_id = tax.tax_id)));
  • palm_id (text): Unique ID for every RdRp identified within palmdb database to date
  • sotu (text): species Operational Taxonomic Unit. sOTUs are the centroid sequence as a result of clustering palmprints (palm_id) at a threshold of 90% amino acid identity, chosen to approximate taxonomic species within the Serratus platform
  • nickname (text): Unique text identifier for each sotu
  • centroid (boolean): Is this specific palm_id a centroid in palmdb?
  • gb_acc (text): Top-scoring GenBank accession to which this palm_id aligned (DIAMOND2). NULL if no-alignment found.
  • percent_identity (number): Percent identity of gb alignment
  • pp_cov (number): Percent query-coverage of the palmprint sequence against gb alignment
  • evalue (number): E-value for gb alignment
  • tax_id (number): NCBI taxonomic ID of genbank sequence
  • tax_kingdom (text): NCBI taxonmy rank of genbank sequence
  • tax_phylum (text): NCBI taxonmy rank of genbank sequence
  • tax_order (text): NCBI taxonmy rank of genbank sequence
  • tax_family (text): NCBI taxonmy rank of genbank sequence
  • tax_genus (text): NCBI taxonmy rank of genbank sequence
  • tax_species (text): NCBI taxonmy rank of genbank sequence

palm_virome (materialized view)

description

The materialized view is generated by the following SQL query:

SELECT palm.run_id AS run,
  meta.scientific_name,
  meta.bio_sample,
  meta.bio_project,
  palm.palm_id,
  palm.sotu,
  tax.nickname,
  tax.gb_acc,
  tax.percent_identity AS gb_pid,
  tax.evalue AS gb_eval,
  tax.tax_species,
  tax.tax_family,
  palm.assembly_node AS node,
  palm.coverage AS node_coverage,
  palm.percent_identity AS node_pid,
  palm.evalue AS node_eval,
  palm.qc_pass AS node_qc,
  palm.q_sequence AS node_seq
FROM ((palm_sra2 palm
  JOIN srarun meta ON (((palm.qc_pass = true) AND (palm.run_id = meta.run))))
  LEFT JOIN palm_tax tax ON (((tax.sotu = palm.sotu) AND (tax.centroid = true))))
ORDER BY palm.run_id;
  • run (text): ...
  • scientific_name (text): ...
  • bio_sample (text): ...
  • bio_project (text): ...
  • palm_id (text): ...
  • sotu (text): ...
  • nickname (text): ...
  • gb_acc (text): ...
  • gb_pid (number): ...
  • gb_eval (number): ...
  • tax_species (text): ...
  • tax_family (text): ...
  • node (number): ...
  • node_coverage (number): ...
  • node_pid (number): ...
  • node_eval (number): ...
  • node_qc (boolean): ...
  • node_seq (text): ...

palmdb (data table)

description

  • palm_id (text): ...
  • sotu (text): ...
  • centroid (boolean): ...
  • palmprint (text): ...
  • tax_phylum (text): ...
  • tax_class (text): ...
  • tax_order (text): ...
  • tax_family (text): ...
  • tax_genus (text): ...
  • tax_species (text): ...
  • nickname (text): ...

palmdb2 (data table)

description

  • palm_id (text): ...
  • sotu (text): ...
  • nickname (text): ...
  • percent_identity (number): ...
  • centroid (boolean): ...
  • palmprint (text): ...

pfamily (data table)

description

  • run_id (text): Run ID from NCBI's SRA
  • family_name (text): ...
  • coverage_bins (text): ...
  • score (number): ...
  • percent_identity (number): ...
  • n_reads (number): ...
  • aligned_length (number): ...

protein (data table)

description

  • run_id (text): Run ID from NCBI's SRA
  • family_name (text): ...
  • protein_name (text): ...
  • coverage_bins (text): ...
  • score (number): ...
  • percent_identity (number): ...
  • n_reads (number): ...
  • aligned_length (number): ...

psequence (data table)

description

  • run_id (text): Run ID from NCBI's SRA
  • family_name (text): ...
  • protein_name (text): ...
  • genbank_id (text): ...
  • coverage_bins (text): ...
  • score (number): ...
  • percent_identity (number): ...
  • n_reads (number): ...
  • aligned_length (number): ...

psra (data table)

Summary files corresponding to the pilot study query set. Columns/rows here were copied verbatim from the first lines of all summary files.

  • run_id (text): Run ID from NCBI's SRA
  • read_length (number): Average read length of all reads in the current SRA entry
  • genome (text): Name of the pangenome used for analysis (see Query Sets)
  • aligned_reads (number): Number of reads that were found alignmed to the query set
  • date (text): Date when the summary file was created

rdrp_pos (data table)

DEPRECATED

rfamily (data table)

description

  • run_id (text): Run ID from NCBI's SRA
  • phylum_name (text): ...
  • family_name (text): ...
  • family_group (text): ...
  • coverage_bins (text): ...
  • score (number): ...
  • percent_identity (number): ...
  • depth (number): ...
  • n_reads (number): ...
  • aligned_length (number): ...

rfamily_counts (materialized view)

description

The materialized view is generated by the following SQL query:

SELECT rfamily.family_name,
  rfamily.score,
  rfamily.percent_identity,
  count(*) AS count
FROM rfamily
GROUP BY rfamily.family_name, rfamily.score, rfamily.percent_identity;
  • family_name (text): ...
  • score (number): ...
  • percent_identity (number): ...
  • count (number): ...

rfamily_list (materialized view)

description

The materialized view is generated by the following SQL query:

SELECT rfamily.family_name
FROM rfamily
GROUP BY rfamily.family_name
ORDER BY rfamily.family_name;
  • family_name (text): ...

rphylum (data table)

description

  • run_id (text): Run ID from NCBI's SRA
  • phylum_name (text): ...
  • coverage_bins (text): ...
  • score (number): ...
  • percent_identity (number): ...
  • depth (number): ...
  • n_reads (number): ...
  • aligned_length (number): ...

rphylum_counts (materialized view)

description

The materialized view is generated by the following SQL query:

SELECT rphylum.phylum_name,
  rphylum.score,
  rphylum.percent_identity,
  count(*) AS count
FROM rphylum
GROUP BY rphylum.phylum_name, rphylum.score, rphylum.percent_identity;
  • phylum_name (text): ...
  • score (number): ...
  • percent_identity (number): ...
  • count (number): ...

rphylum_list (materialized view)

description

The materialized view is generated by the following SQL query:

SELECT rphylum.phylum_name
FROM rphylum
GROUP BY rphylum.phylum_name
ORDER BY rphylum.phylum_name;
  • phylum_name (text): ...

rsequence (data table)

description

  • run_id (text): Run ID from NCBI's SRA
  • phylum_name (text): ...
  • family_name (text): ...
  • family_group (text): ...
  • virus_name (text): ...
  • sequence_accession (text): ...
  • coverage_bins (text): ...
  • score (number): ...
  • percent_identity (number): ...
  • depth (number): ...
  • n_reads (number): ...
  • aligned_length (number): ...

rsequence_counts (materialized view)

description

The materialized view is generated by the following SQL query:

SELECT rsequence.sequence_accession,
  rsequence.score,
  rsequence.percent_identity,
  count(*) AS count
FROM rsequence
GROUP BY rsequence.sequence_accession, rsequence.score, rsequence.percent_identity;
  • sequence_accession (text): ...
  • score (number): ...
  • percent_identity (number): ...
  • count (number): ...

rsequence_list (materialized view)

description

The materialized view is generated by the following SQL query:

SELECT rsequence.sequence_accession,
  rsequence.virus_name
FROM rsequence
GROUP BY rsequence.sequence_accession, rsequence.virus_name
ORDER BY rsequence.sequence_accession;
  • sequence_accession (text): ...
  • virus_name (text): ...

rsra (data table)

Summary files corresponding to the RdRp query set. Columns/rows here were copied verbatim from the first lines of all summary files.

  • run_id (text): Run ID from NCBI's SRA
  • read_length (number): Average read length of all reads in the current SRA entry
  • genome (text): Name of the pangenome used for analysis (see Query Sets)
  • aligned_reads (number): Number of reads that were found alignmed to the query set
  • date (text): Date when the summary file was created
  • truncated (text): Whether or not the summary file is truncated

sra_stat (data table)

description

  • row_id (number): ...
  • run (text): ...
  • taxid (number): ...
  • rank (text): ...
  • name (text): ...
  • kmer (number): ...
  • total_kmers (number): ...
  • kmer_perc (number): ...
  • tax_label (text): ...

sra_stat_group (data table)

description

  • taxid (number): ...
  • tax_rank (text): ...
  • tax_name (text): ...
  • tax_label (text): ...

srarun (data table)

description (child table of srarun_v1)

  • run (text): ...
  • release_date (datetime): ...
  • load_date (datetime): ...
  • spots (number): ...
  • bases (number): ...
  • spots_with_mates (number): ...
  • avg_length (number): ...
  • size_mb (number): ...
  • assembly_name (text): ...
  • download_path (text): ...
  • experiment (text): ...
  • library_name (text): ...
  • library_strategy (text): ...
  • library_selection (text): ...
  • library_source (text): ...
  • library_layout (text): ...
  • insert_size (number): ...
  • insert_dev (number): ...
  • platform (text): ...
  • model (text): ...
  • sra_study (text): ...
  • bio_project (text): ...
  • study_pubmed_id (text): ...
  • project_id (text): ...
  • sample (text): ...
  • bio_sample (text): ...
  • sample_type (text): ...
  • tax_id (text): ...
  • scientific_name (text): ...
  • sample_name (text): ...
  • g1k_pop_code (text): ...
  • source (text): ...
  • g1k_analysis_group (text): ...
  • subject_id (text): ...
  • sex (text): ...
  • disease (text): ...
  • tumor (text): ...
  • affection_status (text): ...
  • analyte_type (text): ...
  • histological_type (text): ...
  • body_site (text): ...
  • center_name (text): ...
  • submission (text): ...
  • dbgap_study_accession (text): ...
  • consent (text): ...
  • run_hash (text): ...
  • read_hash (text): ...

srarun_geo_coordinates (materialized view)

description

The materialized view is generated by the following SQL query:

SELECT srarun.run AS run_id,
  srarun.bio_sample AS biosample_id,
  srarun.release_date,
  bgeo.coordinate_x,
  bgeo.coordinate_y,
  bgeo.from_text
FROM (srarun
  JOIN biosample_geo_coordinates bgeo ON ((srarun.bio_sample = bgeo.biosample_id)));
  • run_id (text): Run ID from NCBI's SRA
  • biosample_id (text): BioSample ID from NCBI's BioSample
  • release_date (datetime): ...
  • coordinate_x (number): ...
  • coordinate_y (number): ...
  • from_text (text): ...

srarun_v1 (data table)

DEPRECATED: Use srarun instead.

  • run (text): ...
  • release_date (datetime): ...
  • load_date (datetime): ...
  • spots (number): ...
  • bases (number): ...
  • spots_with_mates (number): ...
  • avg_length (number): ...
  • size_mb (number): ...
  • assembly_name (text): ...
  • download_path (text): ...
  • experiment (text): ...
  • library_name (text): ...
  • library_strategy (text): ...
  • library_selection (text): ...
  • library_source (text): ...
  • library_layout (text): ...
  • insert_size (number): ...
  • insert_dev (number): ...
  • platform (text): ...
  • model (text): ...
  • sra_study (text): ...
  • bio_project (text): ...
  • study_pubmed_id (text): ...
  • project_id (text): ...
  • sample (text): ...
  • bio_sample (text): ...
  • sample_type (text): ...
  • tax_id (text): ...
  • scientific_name (text): ...
  • sample_name (text): ...
  • g1k_pop_code (text): ...
  • source (text): ...
  • g1k_analysis_group (text): ...
  • subject_id (text): ...
  • sex (text): ...
  • disease (text): ...
  • tumor (text): ...
  • affection_status (text): ...
  • analyte_type (text): ...
  • histological_type (text): ...
  • body_site (text): ...
  • center_name (text): ...
  • submission (text): ...
  • dbgap_study_accession (text): ...
  • consent (text): ...
  • run_hash (text): ...
  • read_hash (text): ...

tax_lineage (data table)

description

  • row_id (text): ...
  • tax_id (number): ...
  • tax_kingdom (text): ...
  • tax_phylum (text): ...
  • tax_order (text): ...
  • tax_family (text): ...
  • tax_genus (text): ...
  • tax_species (text): ...

tax_names (data table)

description

  • row_id (text): ...
  • tax_id (number): ...
  • name_txt (text): ...
  • unique_name (text): ...
  • name_class (text): ...

tax_nodes (data table)

description

  • row_id (text): ...
  • tax_id (number): ...
  • parent_tax_id (number): ...
  • rank (text): ...
  • embl_code (text): ...
  • division_id (number): ...
  • inherited_div (boolean): ...
  • genetic_code_id (number): ...
  • inherted_gc (boolean): ...
  • mitochondrial_genetic_code_id (number): ...
  • inherited_mgc (boolean): ...
  • gb_hidden (boolean): ...
  • hidden_subtree (boolean): ...
  • comments (text): ...

tismap_summary (data table)

(Declan: please annotate)

description

  • bio_project (text): ...
  • run (text): ...
  • bio_sample (text): ...
  • biosample_tags (text): ...
  • scientific_name (text): ...
  • percent_identity (number): ...
  • coverage (number): ...
  • palm_id (text): ...
  • sotu (text): ...
  • centroid (boolean): ...
  • palmprint (text): ...
  • tax_phylum (text): ...
  • tax_class (text): ...
  • tax_order (text): ...
  • tax_family (text): ...
  • tax_genus (text): ...
  • tax_species (text): ...
  • nickname (text): ...
  • tissue (text): ...

tissue_map (data table)

(Declan: please annotate, you could just provide a list of all the features and just mention they boolean columns, but add a meaningful description and caveats, if any)

description

  • biosample_id (text): BioSample ID from NCBI's BioSample
  • derived_cell_line (boolean): ...
  • cancer (boolean): ...
  • abnormal_other (boolean): ...
  • stool (boolean): ...
  • whole_body (boolean): ...
  • embryonic_structure (boolean): ...
  • pluripotent_stem_cell (boolean): ...
  • nervous_system (boolean): ...
  • brain (boolean): ...
  • stem (boolean): ...
  • glia (boolean): ...
  • forebrain (boolean): ...
  • cardiovascular_system (boolean): ...
  • heart (boolean): ...
  • atrium (boolean): ...
  • heart_ventricle (boolean): ...
  • vascular_system (boolean): ...
  • artery (boolean): ...
  • aorta (boolean): ...
  • vein (boolean): ...
  • connective_tissue (boolean): ...
  • adipose_tissue (boolean): ...
  • fibroblast (boolean): ...
  • gland (boolean): ...
  • endocrine_gland (boolean): ...
  • thyroid_gland (boolean): ...
  • head (boolean): ...
  • mouth (boolean): ...
  • teeth (boolean): ...
  • hematopoietic_system (boolean): ...
  • blood (boolean): ...
  • hematopoietic_cell (boolean): ...
  • erythrocyte (boolean): ...
  • leukocyte (boolean): ...
  • monocyte (boolean): ...
  • macrophage (boolean): ...
  • lymphocyte (boolean): ...
  • t_cell (boolean): ...
  • b_cell (boolean): ...
  • natural_killer_cell (boolean): ...
  • granulocyte (boolean): ...
  • neutrophil (boolean): ...
  • basophil (boolean): ...
  • eosinophil (boolean): ...
  • plasma (boolean): ...
  • platelet (boolean): ...
  • integument (boolean): ...
  • mucosa (boolean): ...
  • skin (boolean): ...
  • dermis (boolean): ...
  • epidermis (boolean): ...
  • feather (boolean): ...
  • s_ulcer_tissue (boolean): ...
  • skin_cancer (boolean): ...
  • epithelium (boolean): ...
  • endothelium (boolean): ...
  • epithelioma_cell (boolean): ...
  • adenoma (boolean): ...
  • carcinoma (boolean): ...
  • limb (boolean): ...
  • forelimb (boolean): ...
  • arm (boolean): ...
  • leg (boolean): ...
  • muscular_system (boolean): ...
  • muscle (boolean): ...
  • muscle_fibre (boolean): ...
  • myocyte (boolean): ...
  • skeletal_muscle (boolean): ...
  • respiratory_system (boolean): ...
  • lung (boolean): ...
  • skeletal_system (boolean): ...
  • bone (boolean): ...
  • bone_marrow (boolean): ...
  • trunk (boolean): ...
  • thorax (boolean): ...
  • breast (boolean): ...
  • urogenital_system (boolean): ...
  • reproductive_system (boolean): ...
  • female_repro (boolean): ...
  • ovary (boolean): ...
  • oviduct (boolean): ...
  • vagina (boolean): ...
  • uterus (boolean): ...
  • cervix (boolean): ...
  • male_repro (boolean): ...
  • prostate (boolean): ...
  • viscus (boolean): ...
  • alimentary_canal (boolean): ...
  • anus (boolean): ...
  • esophagus (boolean): ...
  • gastrointestinal_tract (boolean): ...
  • intestine (boolean): ...
  • small_intestine (boolean): ...
  • large_intestine (boolean): ...
  • stomach (boolean): ...
  • gut (boolean): ...
  • foregut (boolean): ...
  • midgut (boolean): ...
  • hindgut (boolean): ...
  • bladder (boolean): ...
  • gall_bladder (boolean): ...
  • urinary_bladder (boolean): ...
  • kidney (boolean): ...
  • liver (boolean): ...
  • pancreas (boolean): ...
  • immune_system (boolean): ...
  • lympoid_tissue (boolean): ...
  • spleen (boolean): ...
  • whole_plant (boolean): ...
  • fruit (boolean): ...
  • shoot (boolean): ...
  • leaf (boolean): ...
  • roots (boolean): ...
  • fungus (boolean): ...
  • fruit_body (boolean): ...