SerratusSQL - ababaian/serratus GitHub Wiki
Overview
Serratus data is stored in a series of tables inside a relational database. It provides an accessible way to access the entire collection of Viral Summary Reports in the project along with additional metadata.
Access Credentials
- Username:
public_reader
- Password:
serratus
- Endpoint:
serratus-aurora-20210406.cluster-ro-ccz9y6yshbls.us-east-1.rds.amazonaws.com
- Port:
5432
(default) - Database:
summary
Client Applications
To access Serratus data, you could use either:
-
psql
, from the command line. A handy one-liner to connect to the db is:psql postgresql://public_reader:serratus@serratus-aurora-20210406.cluster-ro-ccz9y6yshbls.us-east-1.rds.amazonaws.com/summary
-
Tantalus
, the R-package companion toSerratus
-
pgadmin4
or your favorite PostgreSQL GUI
Query Sets
Serratus is made up of several datasets which are described below, and referenced later in the table schemas.
-
"Pilot study" query set: reated by aligning a set of SRA sequence to a curated set of know (translated-)protein sequences for virii, (list here). All tables that start with "p..." are derived from the pilot study query set.
Read more at ref_protref5. -
NT query set: created by aligning ~3.8 million SRA sequences to a set of known complete genomes from all RefSeq vertebrate viruses (excluding retroviruses) and ~10k known Coronaviridae sequences. All tables that start with "n..." are derived from the NT query set.
Read more at red_cov3ma. -
RdRp query set: created by aligning ~5.7 million SRA sequences to a set of known RNA-dependent RNA polymerase sequences. All tables that start with "r..." are derived from the RdRp query set.
Read more at ref_rdrp1. -
Dark RdRp query set: an extension to the RdRp query set that also includes sequences that could be remote orthologs of RdRps or that most likely perform a similar function. All tables that start with "d..." are derived from the RdRp query set.
Read more at ...
(Artem: does this include dark rdrp data as well?)
Table Schemas
SerratusSQL encourages adherence to the SQL Style Guide where possible.
(Notes for maintainers)
-
Primary Data
Data Tables
: should represent external data-sources (.fa
,.pro
,.summary
, ...) as closely to the source material as possible, and do not process, mix, add columns. -
Secondary Data
Materialized Views
: should represent processed, filtered, and/or useful combinations ofdata tables
. Useful is defined as a performing a slow/common operation and/or serving an API-endpoint which combines data from several `data tables.
For each table or view in SerratusSQL, include the following minimal set of information:
* Name and is it a data table
or a materialized view
?
* Summary of the contents and purpose of the table.
* Where is the origin of the data, and when was the data last updated?
* A list of table column names
along, their data type
(not rigorous [boolean, datetime, json, number, text]) and a brief explanation
of what they are.
* Maintainer tasks (if applicable).
Data Tables (Alphabetical)
biosample
biosample_geocode
biosample_sex
biosample_temperature
biosample_tissue
biosample_tissue_tags
dfamily
dphylum
dsequence
dsra
nfamily
nsequence
nsra
palm_acc
palm_gb
palm_graph
palm_sra
palm_sra2
palmdb
palmdb2
pfamily
protein
psequence
psra
rfamily
rphylum
rsequence
rsra
sra_stat
sra_stat_group
srarun
tax_lineage
tax_names
tax_nodes
tismap_summary
tissue_map
Materialized Views (Alphabetical)
analysis_index
biosample_geo_coordinates
nfamily_counts
nfamily_list
nsequence_counts
nsequence_list
palm_tax
palm_virome
rfamily_counts
rfamily_list
rphylum_counts
rphylum_list
rsequence_counts
rsequence_list
srarun_geo_coordinates
Deprecated Tables (Alphabetical, Scheduled for Removal)
biosample_tissue_OLD
rdrp_pos
srarun_v1
analysis_index (materialized view)
A boolean index of which sra run_id
has been analyzed in the different major analyses of the Serratus
project. This table is an API-endpoint on serratus.io to serve which buttons/data-links to display for a given input SRA run.
The materialized view is generated by the following SQL query:
SELECT run_id,
(s.run_id IS NOT NULL) AS srarun,
(nsra.run_id IS NOT NULL) AS nsra,
(psra.run_id IS NOT NULL) AS psra,
(rsra.run_id IS NOT NULL) AS rsra,
(dsra.run_id IS NOT NULL) AS dsra,
(sgc.run_id IS NOT NULL) AS geo,
(upl.micro IS TRUE) AS micro,
(upl.assembly IS TRUE) AS assembly,
upl.assembly_file
FROM ((((((( SELECT srarun.run AS run_id FROM srarun) s
FULL JOIN nsra USING (run_id))
FULL JOIN rsra USING (run_id))
FULL JOIN psra USING (run_id))
FULL JOIN dsra USING (run_id))
FULL JOIN srarun_geo_coordinates sgc USING (run_id))
FULL JOIN assembly_index upl USING (run_id));
run_id (text)
: Run ID from NCBI's SRAsrarun (boolean)
: true ifrun_id
is present on thesrarun
tablensra (boolean)
: true ifrun_id
is present on thensra
tablepsra (boolean)
: true ifrun_id
is present on thepsra
tablersra (boolean)
: true ifrun_id
is present on thersra
tabledsra (boolean)
: true ifrun_id
is present on thedsra
tablegeo (boolean)
: true ifrun_id
is present on thesrarun_geo_coordinates
tablemicro (boolean)
: true ifrun_id
is present on theassembly_index
tableassembly (boolean)
: true ifrun_id
is present on thesrarun
tableassembly_file (text)
: if there's anassembly
, the filename where it is stored
assembly_index (data table)
NOT PUBLIC
(Alex: Document)
(Artem: Why is this not public? Should it be?)
biosample (data table)
Geographical information extracted from the metadata attributes found on the BioSample(s) corresponding to each SRA run.
BioSample attributes make up an extensive list. Harmonized names were used to extract the values present on this table, as it's described below. (Artem: Is this true?).
(Artem: Is this code somewhere?)
biosample_id (text)
: BioSample ID from NCBI's BioSamplerun_id (text)
: Run ID from NCBI's SRAgeo_coord_all (jsonb)
: An object with all the attributes that reference a point on Earthgeo_text_all (jsonb)
: An object with all the attributes that reference a location by it's namecollection_date (text)
: Date on which the sample was collected, as reported on thecollection_date
harmonized attribute from the BioSample metadata.geo_coordinate_x (number)
: Longitude component found ongeo_coord_all
(if present)geo_coordinate_y (number)
: Latitude component found ongeo_coord_all
(if present)geo_text_extracted (text)
: Location name derived fromgeo_text_all
(if present)
biosample_geocode (data table)
Coordinates corresponding to locations names inferred on biosample.geo_text_extracted
.
The coordinates were found by ... (Artem: ???)
geo_text_extracted (text)
: Set of values frombiosample.geo_text_extracted
coordinate_x (number)
: Longitude component of locationcoordinate_y (number)
: Latitude component of location
biosample_geo_coordinates (materialized view)
Geographic location data for all BioSample(s) whose runs are present on the Serratus
project.
This materializedd view combines the results of:
* biosample
: coordinates that were explicitly annotated
* biosample_geocode
: coordinates that were derived from a location name on biosample.geo_text_extracted
The materialized view is generated by the following SQL query:
SELECT b.biosample_id,
bgeo.coordinate_x,
bgeo.coordinate_y,
b.geo_text_extracted AS from_text
FROM (biosample b
JOIN biosample_geocode bgeo ON ((b.geo_text_extracted = bgeo.geo_text_extracted)))
WHERE (bgeo.coordinate_x IS NOT NULL)
UNION ALL
SELECT biosample.biosample_id,
biosample.geo_coordinate_x AS coordinate_x,
biosample.geo_coordinate_y AS coordinate_y,
NULL::text AS from_text
FROM biosample
WHERE (biosample.geo_coordinate_x IS NOT NULL);
biosample_id (text)
: BioSample ID from NCBI's BioSamplecoordinate_x (number)
: Longitude component of locationcoordinate_y (number)
: Latitude component of locationfrom_text (text)
: The location name the coordinates were inferred from, as derived onbiosample.geo_text_extracted
biosample_sex (data table)
Sex information extracted from the metadata attributes found on the BioSample(s) corresponding to each SRA run.
BioSample attributes make up an extensive list. Harmonized names were used to extract the values present on this table, as it's described below. The attributes considered are host_sex
, sex
. (Artem: Is this true?)
biosample_id (text)
: BioSample ID from NCBI's BioSamplerun_id (text)
: Run ID from NCBI's SRAmale (boolean)
: true if any sexual attribute contain the termmale
female (boolean)
: true if any sexual attribute contain the termfemale
other (boolean)
: true if any sexual attribute contain informative sexual terms which are notmale
orfemale
. Examples:hermaphrodite
,intersex
,neuter
.
biosample_temperature (data table)
Temperature information extracted from the metadata attributes found on the BioSample(s) corresponding to each SRA run.
BioSample attributes make up an extensive list. Harmonized names were used to extract the values present on this table, as it's described below. The attributes considered are ... (Artem: Where is this code? I'll get the attributes from there).
biosample_id (text)
: BioSample ID from NCBI's BioSamplerun_id (text)
: Run ID from NCBI's SRAtemperature (number)
: Value of temperature in °C
biosample_tissue (data table)
Tissue information extracted from the metadata attributes found on the BioSample(s) corresponding to each SRA run.
BioSample attributes make up an extensive list. Harmonized names were used to extract the values present on this table, as it's described below. Many attributes were considered for this, to get a list of them run SELECT DISTINCT(source) FROM biosample_tissue;
.
biosample_id (text)
: BioSample ID from NCBI's BioSamplesrs_id (text)
: SRS (sequence entry) ID corresponding to the BioSample IDsource (text)
: BioSample attribute from where the tissue information was extracted fromtext (text)
: Text of the aforementioned attributetissue (text)
: Tissue term that was extracted from the attribute's textbto_id (text)
: Brenda Tissue Ontology ID of the corresponding tissue term
biosample_tissue_OLD (data table)
DEPRECATED: An early parsing of tissue metadata terms from BioSample and BTO. Use biosample_tissue
instead.
biosample_id (text)
: BioSample ID from NCBI's BioSamplesrs_id (text)
: SRS (sequence entry) ID corresponding to the BioSample ID.tissue (text)
: Tissue term that was extracted from the attribute's textbto_id (text)
: Brenda Tissue Ontology ID of the
biosample_tissue_tags (data table)
DEPRECATED? (Artem, Declan)
biosample_id (text)
: BioSample ID from NCBI's BioSamplebiosample_tags (text)
: List of tissue-like(?) terms extracted from BioSample attributesrun (text)
: ???
dfamily (data table)
description
run_id (text)
: Run ID from NCBI's SRAphylum_name (text)
: ...family_name (text)
: ...family_group (text)
: ...coverage_bins (text)
: ...score (number)
: ...percent_identity (number)
: ...depth (number)
: ...n_reads (number)
: ...aligned_length (number)
: ...
dphylum (data table)
description
run_id (text)
: Run ID from NCBI's SRAphylum_name (text)
: ...coverage_bins (text)
: ...score (number)
: ...percent_identity (number)
: ...depth (number)
: ...n_reads (number)
: ...aligned_length (number)
: ...
dsequence (data table)
description
run_id (text)
: Run ID from NCBI's SRAphylum_name (text)
: ...family_name (text)
: ...family_group (text)
: ...virus_name (text)
: ...sequence_accession (text)
: ...coverage_bins (text)
: ...score (number)
: ...percent_identity (number)
: ...depth (number)
: ...n_reads (number)
: ...aligned_length (number)
: ...
dsra (data table)
Summary files corresponding to the Dark RdRp query set. Columns/rows here were copied verbatim from the first lines of all summary files.
run_id (text)
: Run ID from NCBI's SRAread_length (number)
: Average read length of all reads in the current SRA entrygenome (text)
: Name of the pangenome used for analysis (see Query Sets)aligned_reads (number)
: Number of reads that were found alignmed to the query setdate (text)
: Date when the summary file was createdtruncated (text)
: Whether or not the summary file is truncated
nfamily (data table)
description
run_id (text)
: Run ID from NCBI's SRAfamily_name (text)
: ...coverage_bins (text)
: ...score (number)
: ...percent_identity (number)
: ...depth (number)
: ...n_reads (number)
: ...n_global_reads (number)
: ...length (number)
: ...top_genbank_id (text)
: ...top_score (number)
: ...top_length (number)
: ...top_name (text)
: ...
nfamily_counts (materialized view)
description
The materialized view is generated by the following SQL query:
SELECT nfamily.family_name,
nfamily.score,
nfamily.percent_identity,
count(*) AS count
FROM nfamily
GROUP BY nfamily.family_name, nfamily.score, nfamily.percent_identity;
family_name (text)
: ...score (number)
: ...percent_identity (number)
: ...count (number)
: ...
nfamily_list (materialized view)
description
The materialized view is generated by the following SQL query:
SELECT nfamily.family_name
FROM nfamily
GROUP BY nfamily.family_name
ORDER BY nfamily.family_name;
family_name (text)
: ...
nsequence (data table)
description
run_id (text)
: Run ID from NCBI's SRAfamily_name (text)
: ...sequence_accession (text)
: ...coverage_bins (text)
: ...score (number)
: ...percent_identity (number)
: ...depth (number)
: ...n_reads (number)
: ...n_global_reads (number)
: ...length (number)
: ...virus_name (text)
: ...
nsequence_counts (materialized view)
description
The materialized view is generated by the following SQL query:
SELECT nsequence.sequence_accession,
nsequence.score,
nsequence.percent_identity,
count(*) AS count
FROM nsequence
GROUP BY nsequence.sequence_accession, nsequence.score, nsequence.percent_identity;
sequence_accession (text)
: ...score (number)
: ...percent_identity (number)
: ...count (number)
: ...
nsequence_list (materialized view)
description
The materialized view is generated by the following SQL query:
SELECT nsequence.sequence_accession,
nsequence.virus_name
FROM nsequence
GROUP BY nsequence.sequence_accession, nsequence.virus_name
ORDER BY nsequence.sequence_accession;
sequence_accession (text)
: ...virus_name (text)
: ...
nsra (data table)
Summary files corresponding to the NT query set. Columns/rows here were copied verbatim from the first lines of all summary files.
run_id (text)
: Run ID from NCBI's SRAread_length (number)
: Average read length of all reads in the current SRA entrygenome (text)
: Name of the pangenome used for analysis (see Query Sets)version (text)
: ...date (text)
: Date when the summary file was created
palm_acc (data table)
description
row_id (number)
: ...accession (text)
: ...database (text)
: ...palm_id (text)
: ...
palm_gb (data table)
description
row_id (number)
: ...palm_id (text)
: ...gb_acc (text)
: ...percent_identity (number)
: ...pp_cov (number)
: ...evalue (number)
: ...tax_id (number)
: ...
palm_graph (data table)
description
row_index (number)
: ...palm_id1 (text)
: ...palm_id2 (text)
: ...pident (number)
: ...
palm_sra (data table)
description
row_id (number)
: ...run_id (text)
: Run ID from NCBI's SRAassembly_node (number)
: ...coverage (number)
: ...q_start (number)
: ...q_end (number)
: ...q_len (number)
: ...q_strand (text)
: ...palm_id (text)
: ...pp_start (number)
: ...pp_end (number)
: ...pp_len (number)
: ...percent_identity (number)
: ...evalue (number)
: ...cigar (text)
: ...q_sequence (text)
: ...qc_pass (boolean)
: ...
palm_sra2 (data table)
A version of palm_sra meant for development. Use palm_sra instead.
row_id (number)
: ...run_id (text)
: Run ID from NCBI's SRAassembly_node (number)
: ...coverage (number)
: ...q_start (number)
: ...q_end (number)
: ...q_len (number)
: ...q_strand (text)
: ...palm_id (text)
: ...sotu (text)
: ...pp_start (number)
: ...pp_end (number)
: ...pp_len (number)
: ...percent_identity (number)
: ...evalue (number)
: ...qc_pass (boolean)
: ...cigar (text)
: ...q_sequence (text)
: ...
palm_tax (materialized view)
palm_tax
is a materialized view combining the data-tables palmdb2
(palmdb sequences), with palm_gb
(palmdb2 sequences aligned with DIAMOND
against GenBank - July '23). The top-scoring hit in GenBank for each palmdb2
sequence is retained. For each palmDB sequenced aligned to GenBank, the ncbi_taxonomy
for that GenBank sequence was extracted from the tax_lineage
table.
palmDB to GenBank Alignment Command:
diamond blastp \
-q unique.palmprint.faa \
-d ~/nr/nr.dmnd \
--masking seg --unal 1 \
-p 14 -k 1 \
-f 6 qseqid qstart qend qlen \
sseqid sstart send slen \
pident evalue \
sscinames staxids sphylums \
> palmdb_to_gb.pro
The materialized view is generated by the following SQL query:
SELECT gb.palm_id,
pp.sotu,
pp.nickname,
pp.centroid,
gb.gb_acc,
gb.percent_identity,
gb.pp_cov,
gb.evalue,
tax.tax_id,
tax.tax_kingdom,
tax.tax_phylum,
tax.tax_order,
tax.tax_family,
tax.tax_genus,
tax.tax_species
FROM ((palm_gb gb
LEFT JOIN palmdb2 pp ON ((gb.palm_id = pp.palm_id)))
LEFT JOIN tax_lineage tax ON ((gb.tax_id = tax.tax_id)));
palm_id (text)
: Unique ID for every RdRp identified within palmdb database to datesotu (text)
: species Operational Taxonomic Unit. sOTUs are the centroid sequence as a result of clustering palmprints (palm_id) at a threshold of 90% amino acid identity, chosen to approximate taxonomic species within the Serratus platformnickname (text)
: Unique text identifier for each sotucentroid (boolean)
: Is this specificpalm_id
a centroid in palmdb?gb_acc (text)
: Top-scoring GenBank accession to which thispalm_id
aligned (DIAMOND2).NULL
if no-alignment found.percent_identity (number)
: Percent identity of gb alignmentpp_cov (number)
: Percent query-coverage of the palmprint sequence against gb alignmentevalue (number)
: E-value for gb alignmenttax_id (number)
: NCBI taxonomic ID of genbank sequencetax_kingdom (text)
: NCBI taxonmy rank of genbank sequencetax_phylum (text)
: NCBI taxonmy rank of genbank sequencetax_order (text)
: NCBI taxonmy rank of genbank sequencetax_family (text)
: NCBI taxonmy rank of genbank sequencetax_genus (text)
: NCBI taxonmy rank of genbank sequencetax_species (text)
: NCBI taxonmy rank of genbank sequence
palm_virome (materialized view)
description
The materialized view is generated by the following SQL query:
SELECT palm.run_id AS run,
meta.scientific_name,
meta.bio_sample,
meta.bio_project,
palm.palm_id,
palm.sotu,
tax.nickname,
tax.gb_acc,
tax.percent_identity AS gb_pid,
tax.evalue AS gb_eval,
tax.tax_species,
tax.tax_family,
palm.assembly_node AS node,
palm.coverage AS node_coverage,
palm.percent_identity AS node_pid,
palm.evalue AS node_eval,
palm.qc_pass AS node_qc,
palm.q_sequence AS node_seq
FROM ((palm_sra2 palm
JOIN srarun meta ON (((palm.qc_pass = true) AND (palm.run_id = meta.run))))
LEFT JOIN palm_tax tax ON (((tax.sotu = palm.sotu) AND (tax.centroid = true))))
ORDER BY palm.run_id;
run (text)
: ...scientific_name (text)
: ...bio_sample (text)
: ...bio_project (text)
: ...palm_id (text)
: ...sotu (text)
: ...nickname (text)
: ...gb_acc (text)
: ...gb_pid (number)
: ...gb_eval (number)
: ...tax_species (text)
: ...tax_family (text)
: ...node (number)
: ...node_coverage (number)
: ...node_pid (number)
: ...node_eval (number)
: ...node_qc (boolean)
: ...node_seq (text)
: ...
palmdb (data table)
description
palm_id (text)
: ...sotu (text)
: ...centroid (boolean)
: ...palmprint (text)
: ...tax_phylum (text)
: ...tax_class (text)
: ...tax_order (text)
: ...tax_family (text)
: ...tax_genus (text)
: ...tax_species (text)
: ...nickname (text)
: ...
palmdb2 (data table)
description
palm_id (text)
: ...sotu (text)
: ...nickname (text)
: ...percent_identity (number)
: ...centroid (boolean)
: ...palmprint (text)
: ...
pfamily (data table)
description
run_id (text)
: Run ID from NCBI's SRAfamily_name (text)
: ...coverage_bins (text)
: ...score (number)
: ...percent_identity (number)
: ...n_reads (number)
: ...aligned_length (number)
: ...
protein (data table)
description
run_id (text)
: Run ID from NCBI's SRAfamily_name (text)
: ...protein_name (text)
: ...coverage_bins (text)
: ...score (number)
: ...percent_identity (number)
: ...n_reads (number)
: ...aligned_length (number)
: ...
psequence (data table)
description
run_id (text)
: Run ID from NCBI's SRAfamily_name (text)
: ...protein_name (text)
: ...genbank_id (text)
: ...coverage_bins (text)
: ...score (number)
: ...percent_identity (number)
: ...n_reads (number)
: ...aligned_length (number)
: ...
psra (data table)
Summary files corresponding to the pilot study query set. Columns/rows here were copied verbatim from the first lines of all summary files.
run_id (text)
: Run ID from NCBI's SRAread_length (number)
: Average read length of all reads in the current SRA entrygenome (text)
: Name of the pangenome used for analysis (see Query Sets)aligned_reads (number)
: Number of reads that were found alignmed to the query setdate (text)
: Date when the summary file was created
rdrp_pos (data table)
DEPRECATED
run_id (text)
: Run ID from NCBI's SRA
rfamily (data table)
description
run_id (text)
: Run ID from NCBI's SRAphylum_name (text)
: ...family_name (text)
: ...family_group (text)
: ...coverage_bins (text)
: ...score (number)
: ...percent_identity (number)
: ...depth (number)
: ...n_reads (number)
: ...aligned_length (number)
: ...
rfamily_counts (materialized view)
description
The materialized view is generated by the following SQL query:
SELECT rfamily.family_name,
rfamily.score,
rfamily.percent_identity,
count(*) AS count
FROM rfamily
GROUP BY rfamily.family_name, rfamily.score, rfamily.percent_identity;
family_name (text)
: ...score (number)
: ...percent_identity (number)
: ...count (number)
: ...
rfamily_list (materialized view)
description
The materialized view is generated by the following SQL query:
SELECT rfamily.family_name
FROM rfamily
GROUP BY rfamily.family_name
ORDER BY rfamily.family_name;
family_name (text)
: ...
rphylum (data table)
description
run_id (text)
: Run ID from NCBI's SRAphylum_name (text)
: ...coverage_bins (text)
: ...score (number)
: ...percent_identity (number)
: ...depth (number)
: ...n_reads (number)
: ...aligned_length (number)
: ...
rphylum_counts (materialized view)
description
The materialized view is generated by the following SQL query:
SELECT rphylum.phylum_name,
rphylum.score,
rphylum.percent_identity,
count(*) AS count
FROM rphylum
GROUP BY rphylum.phylum_name, rphylum.score, rphylum.percent_identity;
phylum_name (text)
: ...score (number)
: ...percent_identity (number)
: ...count (number)
: ...
rphylum_list (materialized view)
description
The materialized view is generated by the following SQL query:
SELECT rphylum.phylum_name
FROM rphylum
GROUP BY rphylum.phylum_name
ORDER BY rphylum.phylum_name;
phylum_name (text)
: ...
rsequence (data table)
description
run_id (text)
: Run ID from NCBI's SRAphylum_name (text)
: ...family_name (text)
: ...family_group (text)
: ...virus_name (text)
: ...sequence_accession (text)
: ...coverage_bins (text)
: ...score (number)
: ...percent_identity (number)
: ...depth (number)
: ...n_reads (number)
: ...aligned_length (number)
: ...
rsequence_counts (materialized view)
description
The materialized view is generated by the following SQL query:
SELECT rsequence.sequence_accession,
rsequence.score,
rsequence.percent_identity,
count(*) AS count
FROM rsequence
GROUP BY rsequence.sequence_accession, rsequence.score, rsequence.percent_identity;
sequence_accession (text)
: ...score (number)
: ...percent_identity (number)
: ...count (number)
: ...
rsequence_list (materialized view)
description
The materialized view is generated by the following SQL query:
SELECT rsequence.sequence_accession,
rsequence.virus_name
FROM rsequence
GROUP BY rsequence.sequence_accession, rsequence.virus_name
ORDER BY rsequence.sequence_accession;
sequence_accession (text)
: ...virus_name (text)
: ...
rsra (data table)
Summary files corresponding to the RdRp query set. Columns/rows here were copied verbatim from the first lines of all summary files.
run_id (text)
: Run ID from NCBI's SRAread_length (number)
: Average read length of all reads in the current SRA entrygenome (text)
: Name of the pangenome used for analysis (see Query Sets)aligned_reads (number)
: Number of reads that were found alignmed to the query setdate (text)
: Date when the summary file was createdtruncated (text)
: Whether or not the summary file is truncated
sra_stat (data table)
description
row_id (number)
: ...run (text)
: ...taxid (number)
: ...rank (text)
: ...name (text)
: ...kmer (number)
: ...total_kmers (number)
: ...kmer_perc (number)
: ...tax_label (text)
: ...
sra_stat_group (data table)
description
taxid (number)
: ...tax_rank (text)
: ...tax_name (text)
: ...tax_label (text)
: ...
srarun (data table)
description (child table of srarun_v1)
run (text)
: ...release_date (datetime)
: ...load_date (datetime)
: ...spots (number)
: ...bases (number)
: ...spots_with_mates (number)
: ...avg_length (number)
: ...size_mb (number)
: ...assembly_name (text)
: ...download_path (text)
: ...experiment (text)
: ...library_name (text)
: ...library_strategy (text)
: ...library_selection (text)
: ...library_source (text)
: ...library_layout (text)
: ...insert_size (number)
: ...insert_dev (number)
: ...platform (text)
: ...model (text)
: ...sra_study (text)
: ...bio_project (text)
: ...study_pubmed_id (text)
: ...project_id (text)
: ...sample (text)
: ...bio_sample (text)
: ...sample_type (text)
: ...tax_id (text)
: ...scientific_name (text)
: ...sample_name (text)
: ...g1k_pop_code (text)
: ...source (text)
: ...g1k_analysis_group (text)
: ...subject_id (text)
: ...sex (text)
: ...disease (text)
: ...tumor (text)
: ...affection_status (text)
: ...analyte_type (text)
: ...histological_type (text)
: ...body_site (text)
: ...center_name (text)
: ...submission (text)
: ...dbgap_study_accession (text)
: ...consent (text)
: ...run_hash (text)
: ...read_hash (text)
: ...
srarun_geo_coordinates (materialized view)
description
The materialized view is generated by the following SQL query:
SELECT srarun.run AS run_id,
srarun.bio_sample AS biosample_id,
srarun.release_date,
bgeo.coordinate_x,
bgeo.coordinate_y,
bgeo.from_text
FROM (srarun
JOIN biosample_geo_coordinates bgeo ON ((srarun.bio_sample = bgeo.biosample_id)));
run_id (text)
: Run ID from NCBI's SRAbiosample_id (text)
: BioSample ID from NCBI's BioSamplerelease_date (datetime)
: ...coordinate_x (number)
: ...coordinate_y (number)
: ...from_text (text)
: ...
srarun_v1 (data table)
DEPRECATED: Use srarun instead.
run (text)
: ...release_date (datetime)
: ...load_date (datetime)
: ...spots (number)
: ...bases (number)
: ...spots_with_mates (number)
: ...avg_length (number)
: ...size_mb (number)
: ...assembly_name (text)
: ...download_path (text)
: ...experiment (text)
: ...library_name (text)
: ...library_strategy (text)
: ...library_selection (text)
: ...library_source (text)
: ...library_layout (text)
: ...insert_size (number)
: ...insert_dev (number)
: ...platform (text)
: ...model (text)
: ...sra_study (text)
: ...bio_project (text)
: ...study_pubmed_id (text)
: ...project_id (text)
: ...sample (text)
: ...bio_sample (text)
: ...sample_type (text)
: ...tax_id (text)
: ...scientific_name (text)
: ...sample_name (text)
: ...g1k_pop_code (text)
: ...source (text)
: ...g1k_analysis_group (text)
: ...subject_id (text)
: ...sex (text)
: ...disease (text)
: ...tumor (text)
: ...affection_status (text)
: ...analyte_type (text)
: ...histological_type (text)
: ...body_site (text)
: ...center_name (text)
: ...submission (text)
: ...dbgap_study_accession (text)
: ...consent (text)
: ...run_hash (text)
: ...read_hash (text)
: ...
tax_lineage (data table)
description
row_id (text)
: ...tax_id (number)
: ...tax_kingdom (text)
: ...tax_phylum (text)
: ...tax_order (text)
: ...tax_family (text)
: ...tax_genus (text)
: ...tax_species (text)
: ...
tax_names (data table)
description
row_id (text)
: ...tax_id (number)
: ...name_txt (text)
: ...unique_name (text)
: ...name_class (text)
: ...
tax_nodes (data table)
description
row_id (text)
: ...tax_id (number)
: ...parent_tax_id (number)
: ...rank (text)
: ...embl_code (text)
: ...division_id (number)
: ...inherited_div (boolean)
: ...genetic_code_id (number)
: ...inherted_gc (boolean)
: ...mitochondrial_genetic_code_id (number)
: ...inherited_mgc (boolean)
: ...gb_hidden (boolean)
: ...hidden_subtree (boolean)
: ...comments (text)
: ...
tismap_summary (data table)
(Declan: please annotate)
description
bio_project (text)
: ...run (text)
: ...bio_sample (text)
: ...biosample_tags (text)
: ...scientific_name (text)
: ...percent_identity (number)
: ...coverage (number)
: ...palm_id (text)
: ...sotu (text)
: ...centroid (boolean)
: ...palmprint (text)
: ...tax_phylum (text)
: ...tax_class (text)
: ...tax_order (text)
: ...tax_family (text)
: ...tax_genus (text)
: ...tax_species (text)
: ...nickname (text)
: ...tissue (text)
: ...
tissue_map (data table)
(Declan: please annotate, you could just provide a list of all the features and just mention they boolean columns, but add a meaningful description and caveats, if any)
description
biosample_id (text)
: BioSample ID from NCBI's BioSamplederived_cell_line (boolean)
: ...cancer (boolean)
: ...abnormal_other (boolean)
: ...stool (boolean)
: ...whole_body (boolean)
: ...embryonic_structure (boolean)
: ...pluripotent_stem_cell (boolean)
: ...nervous_system (boolean)
: ...brain (boolean)
: ...stem (boolean)
: ...glia (boolean)
: ...forebrain (boolean)
: ...cardiovascular_system (boolean)
: ...heart (boolean)
: ...atrium (boolean)
: ...heart_ventricle (boolean)
: ...vascular_system (boolean)
: ...artery (boolean)
: ...aorta (boolean)
: ...vein (boolean)
: ...connective_tissue (boolean)
: ...adipose_tissue (boolean)
: ...fibroblast (boolean)
: ...gland (boolean)
: ...endocrine_gland (boolean)
: ...thyroid_gland (boolean)
: ...head (boolean)
: ...mouth (boolean)
: ...teeth (boolean)
: ...hematopoietic_system (boolean)
: ...blood (boolean)
: ...hematopoietic_cell (boolean)
: ...erythrocyte (boolean)
: ...leukocyte (boolean)
: ...monocyte (boolean)
: ...macrophage (boolean)
: ...lymphocyte (boolean)
: ...t_cell (boolean)
: ...b_cell (boolean)
: ...natural_killer_cell (boolean)
: ...granulocyte (boolean)
: ...neutrophil (boolean)
: ...basophil (boolean)
: ...eosinophil (boolean)
: ...plasma (boolean)
: ...platelet (boolean)
: ...integument (boolean)
: ...mucosa (boolean)
: ...skin (boolean)
: ...dermis (boolean)
: ...epidermis (boolean)
: ...feather (boolean)
: ...s_ulcer_tissue (boolean)
: ...skin_cancer (boolean)
: ...epithelium (boolean)
: ...endothelium (boolean)
: ...epithelioma_cell (boolean)
: ...adenoma (boolean)
: ...carcinoma (boolean)
: ...limb (boolean)
: ...forelimb (boolean)
: ...arm (boolean)
: ...leg (boolean)
: ...muscular_system (boolean)
: ...muscle (boolean)
: ...muscle_fibre (boolean)
: ...myocyte (boolean)
: ...skeletal_muscle (boolean)
: ...respiratory_system (boolean)
: ...lung (boolean)
: ...skeletal_system (boolean)
: ...bone (boolean)
: ...bone_marrow (boolean)
: ...trunk (boolean)
: ...thorax (boolean)
: ...breast (boolean)
: ...urogenital_system (boolean)
: ...reproductive_system (boolean)
: ...female_repro (boolean)
: ...ovary (boolean)
: ...oviduct (boolean)
: ...vagina (boolean)
: ...uterus (boolean)
: ...cervix (boolean)
: ...male_repro (boolean)
: ...prostate (boolean)
: ...viscus (boolean)
: ...alimentary_canal (boolean)
: ...anus (boolean)
: ...esophagus (boolean)
: ...gastrointestinal_tract (boolean)
: ...intestine (boolean)
: ...small_intestine (boolean)
: ...large_intestine (boolean)
: ...stomach (boolean)
: ...gut (boolean)
: ...foregut (boolean)
: ...midgut (boolean)
: ...hindgut (boolean)
: ...bladder (boolean)
: ...gall_bladder (boolean)
: ...urinary_bladder (boolean)
: ...kidney (boolean)
: ...liver (boolean)
: ...pancreas (boolean)
: ...immune_system (boolean)
: ...lympoid_tissue (boolean)
: ...spleen (boolean)
: ...whole_plant (boolean)
: ...fruit (boolean)
: ...shoot (boolean)
: ...leaf (boolean)
: ...roots (boolean)
: ...fungus (boolean)
: ...fruit_body (boolean)
: ...