db schema - nextstrain/flora GitHub Wiki
The database consists of a number of tables in a relational-type format, however this relational mapping must be verified and maintained by us. Each pathogen will get a separate set of tables - i.e. this page defines the schema for a single pathogen. This design was chosen to avoid duplication of data.
Geographical data
https://github.com/nextstrain/flora/wiki/geo
Sacra representation of tables === flora upload JSON === flora download JSON
In our schema, each table is representable as a JSON, with tables an array of dictionaries (each dictionary representing a row).
{
table_name: [
{row 1}
{row 2}
...
],
table_name: [...]
}
missing data
All values are strings unless specified otherwise. Missing data should always be none
(??)
Please edit this page as necessary.
owners / permissions
Data should be associated with an owner
field, which exists as a key in an owners
table. This allows members of that group (i.e. oAuth logins) to manipulate data which they own (even publicly available data should have an owner). The owners table will map owner
to an array of oAuth logins. There seem to be two main questions:
- Where should the owners table live - i.e part of each pathogen DB or in a separate DB?
- Should the
owner
field be associated with a strain or a sequence? If the latter, what happens when multiple sequences of the same strain have different owners?
Tables.
dbinfo
table
hash
- the current state of the DB. Changes every time the DB changes.name
- the name of the pathogen / DB
samples
table (formally "virus")
sample
(primary key)strain
often identical tosample
. This schema allows multiple samples from the same strain. Note that currently this table is collapsed for augur / auspice such that there are only uniquestrain
s, and thatsample
is not currently used in those codebases.accessions
-> array of values, each appearing as a key in the Sequences tablehost
host_age
(numeric)date
(XXXX-XX-XX) sample collection dateisolate_ids
-> array (what are these? should they be here or just in sequences?)lineage
num_segments
(numeric) (and/or array of segment names?)region
(perhaps not in this table)country
(admin0)division
(admin1)admin2
- city / area smaller than admin1lat/long
: ? I expect isolates to come with a GPS tag soonpublic
(bool)
Sequences table
accession
locus
passage_category
submitting_lab
isolate_id
sequence
initial_upload_date
- date of initial upload to gisaid/vipr/github/etc. I.e the first time this sequence was available for analysis.URL
references
table
This (proposed) table would store the data currently in genbank / augur config files. Since the current situation is acceptable, this can be left as is for the foreseeable future.
Publication / Accreditation table
https://github.com/nextstrain/flora/wiki/geo/accreditation
Dropped_sequences table
Replaces the ad-hoc specification currently in augur