Architecture: engine - metaspace2020/metaspace Wiki

Engine contains several components, mainly connected by a RabbitMQ message bus:

(NOTE: This can be edited - grab the raw file from /docs/Engine Layout.drawio.svg and edit it in https://app.diagrams.net/)

API

API is mainly a bridge for sm-graphql to talk with the Python codebase.

Update Daemon / Update Queue

The Update Daemon performs several functions, all are triggered by Update Queue messages:

The Update Daemon runs multiple threads (usually 4), each can process 1 message at a time.

Some

(Spark) Annotate Daemon / Annotate Queue / Cluster Auto-Start Daemon

When a message is added to the annotate queue:

  1. Cluster Auto-Start detects the message without consuming it, and starts the Spark cluster if it's not already running
  2. The Spark cluster starts Annotate Daemon
  3. Annotate daemon processes the message:
    1. it sets a cluster-busy: yes lock in Redis
    2. it fetches the dataset information from postgres/S3
    3. it sets the dataset status to "Processing" (including sending an UPDATE message to the Update queue)
    4. it runs the annotation pipeline against the dataset
    5. it saves the results to postgres/S3
    6. it adds an INDEX message and a CLASSIFY_OFFSAMPLE message to the Update queue
    7. it releases the lock by setting cluster-busy: no in Redis
  4. Cluster Auto-Start waits for the queue to become empty AND for cluster-busy: no then destroys the Spark cluster.

When running in a development docker setup, Cluster Auto-Start isn't used at all. Annotate Daemon runs continuously.

Lithops Daemon / Lithops Queue

This daemon listens to the Lithops Queue and processes messages one-at-a-time. It is parallelized by running multiple instances of the Daemon, managed via supervisor.

Lithops has historically had resource/memory leak issues, so this daemon is configured to kill itself after every failed annotation, or after running for 24 hours. Supervisor restarts it after it exits.

Status Update Queue

This queue is for signalling to graphql that a dataset has changed, so that webapp can show notifications and update data needing to refresh. All modules that can modify the dataset (API, Update Daemon, Lithops Daemon, Annotate Daemon) can potentially create messages on this queue.

The GraphQL side is handled in graphql/src/dataset/controller/Subscription.ts. Because ElasticSearch updates are async, graphql will wait for ES to finish updating before notifying webapp. It does this by repeatedly querying ElasticSearch until it sees that its ES status matches the status in the message.

Off-Sample Service

Update Daemon sends it batches of base-64-encoded images in a JSON request, and it returns their off-sample classification, e.g. {"label": "on", "prob": 0.123} which indicates that the image is classified as on-sample and the raw model output was 0.123 (values <0.5 are on-sample, >0.5 are off-sample).

Off-Sample Service does not connect directly to any other database or storage. It just responds to HTTP requests.