Cloud Integration - AndersenLab/CAENDR GitHub Wiki
This is a quick list of the cloud services we're using for the job pipeline, and what they do.
All URLs below are project-agnostic. To properly view the service, you may need to select a project in the top left-hand corner of the screen.
Service | Description | GCP Link |
---|---|---|
Cloud Run | Runs persistent "Services" (the main site, the API, etc) and temporary "Jobs" (tool computations, etc). | link |
Container Registry | Storage for containerized code - compiled versions of individual modules. | link |
Datastore | Storage for "entities", a type of unstructured data (i.e. doesn't the set of possible fields is unspecified). Generally used to store metadata about files, services, conceptual items, etc. Our code imposes field names / types on these values. | link |
Cloud Storage | Generalized data file ("blob") storage. The bulk of CaeNDR data lives here, or at least starts its life here. | link |
Cloud Tasks | Schedules jobs (read: tool computations) to be run asynchronously. | link |
Pub/Sub | Polls jobs to see if they've finished running yet. | link |
Cloud Run lets you run a container asynchronously in the cloud. This generally treats a container as a "black box" -- we don't know anything about how a given container works internally, but we know what data to give it as input. There are two ways to run containers in Cloud Run: as "services" and as "jobs".
Services are persistent -- a service is always running, and is open to new connections (i.e. HTTP requests). This is where we run the main site and the API. When we want to deploy a new version of e.g. the site (or roll back to an old one), this is where we go.
Jobs are temporary -- they are created to run a specific computation, and shut down when that computation is complete. They generally are not open to new connections. This is how we run the containerized tools provided by the Anderson Lab.
This is storage for containers, e.g. the site code or the tools. When we make publish
a container, this is where it ends up. Each Cloud Run deployment (either type) is based on a specific version of a container that lives here.
Note: Container Registry is being deprecated and replaced by Artifact Registry. We will need to migrate to the new service.
This is storage for "entities", a type of unstructured data. In this case, "unstructured" means that Datastore doesn't define the set of possible fields -- whatever fields exist in the data will show up as table columns.
Each entity has a "kind", which defines what type of data it is. Each kind (that we use, at least) has a corresponding Entity
subclass in the codebase.
Most of our entities store metadata about a particular type of "thing" in the project, and provide enough information to find the associated data elsewhere on GCP. Reports, for example, store the container version & data hash, which can be used to uniquely find the input file in Cloud Storage.
For example, we store the following as entities:
- Browser tracks (for Genome Browser & Indel Finder)
- Users' Carts
- Job Reports (metadata like the user, data hash, container, etc -- these can be used to find the input/output data in Cloud Storage)
- Versions of data, like dataset releases and containers (i.e. metadata about the release and where to find the associated files)
- User accounts
- Profiles (for the "People" page)
- Species
For more on Entities, see the page ???.
This is "Bucket" storage, where we store files (or more generally, "blobs"). For a list of buckets used by the project, see the section Buckets.
This service schedules jobs to be run. Each job type (read: each tool) has its own queue. To start a job, we use the appropriate Task subclass to schedule the job in its queue. Cloud Tasks will then send a message to the API "start" endpoint to start that job.
TL;DR: We use this to poll jobs to see if they've finished running yet.
This service helps coordinate events across a distributed system. For example, our API (Cloud Run Service) wants to be updated when a tool (Cloud Run Job) finishes running. A given service can send a message to its "Topic", which will propagate to one or more "Subscribers"; in turn, each Subscriber sends a message to its associated service.
Our app takes advantage of a nice feature of Subscribers: if they send a request to service and don't get a "200 OK" response, they will keep periodically re-sending that request. Sending a 200 response is called "acknowledging" or "acking" message, and sending an error response is called "unacking" the message.
When we start a Cloud Run job for a tool, we also send a message to Pub/Sub. This will keep polling the API "status" endpoint for the status of the job: the API will return an error code if the job is still running, and finally will return a success code if the job is finished. Once the job is finished, the API will notify the user that their job is completed, and "ack" the incoming message, removing it from Pub/Sub's queue.