Architecture: overview - metaspace2020/metaspace GitHub Wiki

Component layout

System layout diagram

(NOTE: This can be edited - grab the raw file from /docs/System Layout.drawio.svg and edit it in https://app.diagrams.net/)

Web Server

nginx

Configuration is primarily defined through ansible settings. The compiled results can be seen at /etc/nginx/sites-enabled/default on the web server.

At time of writing, it's used for:

a gateway to the sm-graphql API server, with separate mappings for:
- GraphQL API (HTTP and web sockets)
- Authentication/signup HTTP endpoints
- Dataset and MolDB upload signing HTTP endpoints (actual uploads go to S3, these endpoints just generate signatures)
- Raw optical image upload/download.
  - Previously other image types (iso images, ion thumbnails, etc.) used these handlers, but they were moved to S3 storage
  - There's a task to move raw optical images to S3 so that this endpoint can be removed
serving the compiled webapp files
- index.html has a 10 minute cache expiry. Resource files have a 1 year/immutable cache expiry.
- The JS/CSS/image files for the previous deployed version are also kept and served so that the website doesn't break for people that loaded a slightly old version of index.html. All resource files except index.html have a hash in their filename to avoid name conflicts between different versions. Only 1 previous version is kept (in /var/www/webapp/old_ver/)
- All unrecognized URLs that don't look like a filename serve index.html. This is necessary so that all the Vue Router paths like /datasets load index.html
serving molecule images (shown in the "Molecules" section on the annotations page)
- These images are manually obtained from the provided database, or manually generated from the molecules InChI/SMILES descriptors

RabbitMQ

RabbitMQ is used as a service bus for Python/engine tasks that don't require immediate feedback. It has very little manual configuration - just a username and password configured through Ansible. Engine code automatically creates queues if they are missing.

There is a management UI at http://metaspace2020.eu:15672 - only accessible from inside EMBL or via VPN. The password can be found in /opt/dev/metaspace/metaspace/engine/conf/config.json. Be careful because if you try to "Get" a message without putting it in "requeue" mode, it will consume the message, which will make the dataset go out of sync or get stuck and need to be manually reprocessed.

Queues

Update queue: responsible for ES indexing tasks and managing off-sample classification batches
- Message producers: sm-api
- Message consumers: update-daemon
(Spark) Annotate queue: responsible for annotation jobs that run on the Spark cluster
- Message producers: sm-api, lithops-daemon (when lithops fails)
- Message listeners: cluster-auto-start-daemon (starts cluster when queue isn't empty)
- Message consumers: annotate-daemon
Lithops queue: responsible for annotation jobs that run on Lithops
- Message producers: sm-api, lithops-daemon (re-queues tasks on first failure)
- Message consumers: lithops-daemon
Dataset status queue: allows engine code to notify graphql about changes to datasets so that the webapp can be notified of updates
- Message producers: potentially all engine services
- Message consumers: sm-graphql

Redis: web session storage & cluster lock

When a user goes to https://metaspace2020.eu, their browser gets sent an api.sid cookie. If you look at the value of the cookie it follows a clear pattern: s%3A then 32 characters of the session ID, then ., then a signature, e.g.

Cookie: api.sid=s%3A4fo70qd-SG8gf2nkbs-ohkUSXDgeMvoW.Z%2FkO5l2yXZ9fCNuqAHHMfL1OcpDf94nNx9qatc%2BNOJQ
                    |^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^|
                      This part is the Session ID
                    4fo70qd-SG8gf2nkbs-ohkUSXDgeMvoW

If you take the session ID, you can examine the session on through the Redis CLI on the web server:

> redis-cli get "sess:4fo70qd-SG8gf2nkbs-ohkUSXDgeMvoW"
"{\"cookie\":{\"originalMaxAge\":2592000000,\"expires\":\"2021-12-31T14:23:17.850Z\",\"httpOnly\":true,\"path\":\"/\"}}"

For a logged-in user's session, this will also include the user ID and any other data in session storage, such as project review tokens.

Additionally, cluster-auto-start-daemon and annotate-daemon use the cluster-busy key to communicate whether annotate-daemon is currently processing. Cluster-auto-start-daemon will only attempt to shut down the cluster if annotate-queue is empty AND cluster-busy is not 'yes'. The queue length alone is not a good indicator of when it's safe to shut down the cluster - if the cluster is running and annotate-daemon crashes, the queue will not be consumed and the cluster will stay running until manually stopped. The cluster-busy key is created with a 13 hour expiry so that it gets reliably unset if annotate-daemon crashes.

sm-graphql

Called "sm-graphql" so that it's not confused with GraphQL (the protocol). "SM" stands for Spatial Metabolomics - the early codename for METASPACE.

sm-graphql handles all interaction with the browser (except for serving static content). This is intentional - all authentication/authorization is handled in one place. Server requests that can be handled entirely in Python should still call through sm-graphql - this keeps the API surface area small and makes it easier to validate security.

See Architecture: graphql

engine: sm-api

Called "sm-api" to differentiate it from the API (i.e. sm-graphql and/or python-client). sm-api is an internal web server to allow sm-graphql access to Python-based functions in the engine project. It handles both synchronous operations (e.g. deleting a dataset or adding a new molecular database), and asynchronous operations that are just added as tasks to one of the queues (e.g. annotating a dataset)

sm-api uses the Bottle web server with the CherryPy backend to allow multiple requests to be handled in parallel through multithreading. The performance is pretty bad though. FastAPI (also used in the ImzML Browser prototype backend) would be a good replacement.

engine: update-daemon

Update daemon processes messages from the update-queue:

Index: Fully (re-)index a dataset and its annotations into ElasticSearch
Update: Do a partial update of a dataset in ElasticSearch (updated fields are specified in the message)
Delete: Delete a dataset from all places (ElasticSearch, PostgreSQL, ion image storage, etc.)
Classify off-sample: Pass all unclassified ion images for the dataset to the off-sample classification model (which runs on ECS), then save the results to database and reindex the dataset

Update daemon normally runs 4 queue processor threads so that multiple indexing operations can happen in parallel.

engine: lithops-daemon

Runs annotation jobs received from the lithops queue, first copying the ImzML file from S3 to IBM COS, then running annotation in IBM Cloud using Lithops, then saving the results to PostgreSQL and S3.

These daemons have to deal with a lot of instability, so the queue consumers run in 4 separate processes instead of separate threads. If any errors occur or a lithops call takes >1 hour to execute, the daemon will exit to prevent further errors. Supervisorctl will usually restart the daemon.

If the first annotation attempt fails, it will retry. If it fails again, it will re-queue the annotation task in the (Spark) annotate-queue.

PostgreSQL, ElasticSearch, file storage, etc.

See Architecture: data storage

Other AWS

engine: annotate daemon, Spark cluster, cluster auto-start daemon

Cluster auto-start daemon monitors the annotate-queue and Redis, and runs Ansible commands to handles provisioning, deploying, and destroying the EC2 Spark Cluster.

The Spark Cluster usually consists of one "master" instance running annotate-daemon and 3 "slave" instances which just run Spark and wait to receive work from the "master" instance.

Annotate-daemon was intended to be deprecated after the introduction of lithops-daemon. However, we decided to keep it as a fallback due to instability with Lithops.

off-sample service

Runs on AWS Elastic Container Service, mostly configured by hand.
Uses auto-scaling load balancer rules - minumum 1 container instance, but increases as more requests are made
Docker build instructions
Files used in publication are in the sm-off-sample S3 bucket
Unsure if the model file is saved anywhere... If needed, run the Docker container and copy the model file out.

IBM Cloud

Used exclusively for Lithops