db schema - nextstrain/flora GitHub Wiki

The database consists of a number of tables in a relational-type format, however this relational mapping must be verified and maintained by us. Each pathogen will get a separate set of tables - i.e. this page defines the schema for a single pathogen. This design was chosen to avoid duplication of data.

Geographical data

https://github.com/nextstrain/flora/wiki/geo

Sacra representation of tables === flora upload JSON === flora download JSON

In our schema, each table is representable as a JSON, with tables an array of dictionaries (each dictionary representing a row).

{
   table_name: [
       {row 1}
       {row 2}
       ...
   ],
   table_name: [...]
}

missing data

All values are strings unless specified otherwise. Missing data should always be none (??)

Please edit this page as necessary.

owners / permissions

Data should be associated with an owner field, which exists as a key in an owners table. This allows members of that group (i.e. oAuth logins) to manipulate data which they own (even publicly available data should have an owner). The owners table will map owner to an array of oAuth logins. There seem to be two main questions:

  1. Where should the owners table live - i.e part of each pathogen DB or in a separate DB?
  2. Should the owner field be associated with a strain or a sequence? If the latter, what happens when multiple sequences of the same strain have different owners?

Tables.

dbinfo table

  • hash - the current state of the DB. Changes every time the DB changes.
  • name - the name of the pathogen / DB

samples table (formally "virus")

  • sample (primary key)
  • strain often identical to sample. This schema allows multiple samples from the same strain. Note that currently this table is collapsed for augur / auspice such that there are only unique strains, and that sample is not currently used in those codebases.
  • accessions -> array of values, each appearing as a key in the Sequences table
  • host
  • host_age (numeric)
  • date (XXXX-XX-XX) sample collection date
  • isolate_ids -> array (what are these? should they be here or just in sequences?)
  • lineage
  • num_segments (numeric) (and/or array of segment names?)
  • region (perhaps not in this table)
  • country (admin0)
  • division (admin1)
  • admin2 - city / area smaller than admin1
  • lat/long: ? I expect isolates to come with a GPS tag soon
  • public (bool)

Sequences table

  • accession
  • locus
  • passage_category
  • submitting_lab
  • isolate_id
  • sequence
  • initial_upload_date - date of initial upload to gisaid/vipr/github/etc. I.e the first time this sequence was available for analysis.
  • URL

references table

This (proposed) table would store the data currently in genbank / augur config files. Since the current situation is acceptable, this can be left as is for the foreseeable future.

Publication / Accreditation table

https://github.com/nextstrain/flora/wiki/geo/accreditation

Dropped_sequences table

Replaces the ad-hoc specification currently in augur