Architecture: data storage - metaspace2020/metaspace GitHub Wiki

PostgreSQL

The database schema is "owned" by sm-graphql. All changes must be made to the TypeScript model.ts files and applied through TypeORM migrations. More info in SQL migrations.

Development

See Dev-environment-tips#set-up-datagrip-to-connect-to-the-production-databases for an easy way to connect DataGrip to prod without needing the VPN.

If you're good with SQL, sometimes it's fastest to just SSH to a server and run commands directly e.g. sudo -u postgres psql sm postgres -c "SELECT * FROM graphql.dataset WHERE id = '...'"

Be aware that the engine tests need a static copy of the DB schema. This file is generated in the sm-graphql project by running yarn run gen-sql-schema. It should be checked in to git so that the CI can run engine tests without needing to set up an environment to run sm-graphql.

Development decisions

The decision to use TypeORM was made after evaluating all popular Python and TypeScript ORMs at the time, and finding that only TypeORM had satisfactory performance, automated migration generation and a strong connection between the database schema and the language's static type checking. Additionally, as sm-graphql handles significantly more SQL operations than engine, it made sense to prioritize convenience in sm-graphql.

There are currently two SQL namespaces (called "schemas" in SQL terminology, but they shouldn't be confused with the schema definition), public and graphql. graphql was introduced in an attempt to keep the engine and sm-graphql applications independent - a decision that was soon found to slow development with no benefit. After a group discussion, we decided all new tables should be in the public namespace unless there's a good organizational reason not to, such as a new many-to-many table between two existing graphql tables.

Gotchas

There are two `dataset` tables: `public.dataset` and `graphql.dataset`

This is an awkward separation of responsibility - engine manages creation/deletion of public.dataset and sm-graphql manages creation/modification/deletion of graphql.dataset, however both applications can read both tables. They are not connected by a foreign key, and occasionally they go out of sync.

This was not an intentional design decision, it was just too difficult to refactor and migrate into a clean state at the time. engine's lack of an ORM makes it tedious and error-prone to add new columns to public.dataset.

Where to place a column depends on which application is responsible for saving it to the database. Most new columns should go in graphql.dataset because sm-graphql handles most basic CRUD operations. Columns should only be added to public.dataset if the value is calculated/modified from engine code.

METASPACE runs Postgres 9.5

No particular reason for this - we just haven't invested the time to upgrade yet.

It has a limited set of JSON functions
It doesn't cache evaluation of JSON operations, which can make JSON access in the annotation table very slow. E.g. if you SELECT or use a WHERE filter on dataset.config->'isotope_generation' in a query that joins to annotation, it will JSON-decode dataset.config for EVERY output row, even if there's only 1 dataset and 10000 annotations. To get around this, use WITH CTEs to make subqueries that filter/extract the dataset JSON fields separately from the query that joins to annotation.

ElasticSearch

ElasticSearch is used for searching, retrieving and applying authorization rules for datasets and annotations. It's possible to 100% recreate the ElasticSearch index based on data stored in Postgres using scripts/update_es_index.py, however: 1) this process takes days to run, and 2) some of the data (m/z values, isomers/isobars) is computed on-the-fly and can only be computed in Python code.

It's pretty messy to work with, so we've try to minimize the surface area interacting with ElasticSearch. All schema definitions and write operations should be kept in es_export.py and all queries should be kept in esConnector.ts.

I really can't give any other development tips - every time I've worked with ElasticSearch I've needed to read the docs.

Web Server Filesystem

/var/www/mol-images - Molecular database molecule images (served through nginx)
/var/www/webapp - Compiled webapp (served through nginx)
/var/www/webapp/old_ver - Previous compiled version of webapp (also served through nginx, allowing the website to still work even if people don't refresh for a while after a deployment)
/opt/dev/metaspace/metaspace/graphql/health.json - When there is a maintenance message or read-only mode is active, graphql will use this file to persist the status even if the process is restarted. This file is usually removed automatically during Ansible deployment.
/opt/data/metaspace/public/raw_optical_images/ - Contains raw user-uploaded optical images. After these are submitted, sm-api also makes several lower-resolution copies that are stored in S3.

S3

The AWS CLI is much easier to use than the web interface. Once set up you can do e.g. aws s3 ls s3://sm-engine-upload | less to view a bucket, or just aws s3 ls to list buckets.