Home - nextstrain/flora GitHub Wiki
Flora is the next iteration of fauna. It aims to be a cloud-based service by which data can be added or retrieved from a database. Sacra will be used for data cleaning and formatting, and as such the DB schema is enforced there. This wiki serves as the roadmap for flora, the database, and the desired user interaction.
Aims:
Top create a piece of software that can run both locally and on a server with two main aims.
- take genomic data / metadata, transform it into a consistent schema (via sacra), verify the salient changes that this data would cause in the database with the user, and modify the database.
- View and download data from the database.
Flora can be seen as a middle layer between raw data and the database. Communication between users and flora will be via scripts (if run locally) or an API. This API can be called either via the command line or a web GUI (see below). Views will also be provided into the data via API (returning JSON documents) or web GUIs. Ideally modification of the database can also occur via a GUI. Users will be authenticated (oAuth) in flora, and flora will have full read/write access to the DB (see below). The cleaned JSONs (sacra output) which are added to the DB will be stored somewhere, theoretically making it possible to rebuild the database. This makes modifying the database in-place (e.g. via a GUI) extremely difficult, but it could be possible if these modifications actually generate an intermediate JSON which is stored. Backups should happen daily.
Database structure
https://github.com/nextstrain/flora/wiki/db-schema
Steps:
- IN PROGRESS Sacra (not described here) and database schema (https://github.com/nextstrain/flora/wiki/db-schema)
- DONE Choice of database - see https://github.com/nextstrain/fauna/wiki/DB-feature-spec For the foreseeable future, we are sticking with RethinkDB
- IN PROGRESS Basic database functionality, e.g. Table creation, JSON -> DB -> JSON
- JSON & DB diff tooling this is essential to check that the operation is desired and to know what will be changed. This also allows one to understand how the database changes over time, assuming that JSON backups of the database have occurred.
- Relational-enforcement scripts // DB verification scripts
- Backup scripts
- Design flora API
- Ability to run / access API from a server
- Web-GUI access to the API
- Web-GUI views into the data (This is a lot of work) this could also include stats, maps, etc
- Web-GUI modification of the database (HARD)
- Authentication & Permissions (see below)
Authentication / Permissions / Owners
This is a complicated arena, and we haven't got a perfect idea of how to achieve this yet.
Here's the general idea:
Firstly, sequences are marked as public / private and anyone can see public data.
Each piece of data (defined as a row in the pathogens table) has an owner
(even public data - in this case the person who uploaded it) and only the owner may modify that data.
Users authenticate using GitHub accounts & oAuth.
The DB contains a (hidden) table mapping owners
to github logins
.
Known problems:
- What happens if a different owner adds a sequence to an already present virus? Tentative answer: create a separate (almost duplicated) row in the pathogens table.