RabbitMQ Microservice reliability - department-of-veterans-affairs/abd-vro GitHub Wiki

This is a condensation of a short tech talk discussing findings and recommendations from issue #565.

RabbitMQ uses acknowledgments to ensure that queue consumers reliably receive messages. A consumer can listen for messages on a queue using one of two settings:

Auto-ack (no-ack) - Once RabbitMQ has successfully written a message to the TCP socket, its job is done and it forgets about the message. From RabbitMQ's perspective, it does not expect any acknowledgment back from the consumer. From the consumer's perspective, the receipt of the message comes with an implication of acknowledgment (courtesy of TCP's syn-ack handshake); no further action is needed to tell RabbitMQ to forget about the message.
Explicit ack - The consumer is responsible for explicitly sending an acknowledgment back to RabbitMQ. If it doesn't (due to buggy code, or a system failure that the consumer can't handle), RabbitMQ will wait a certain amount of time (30 minutes by default), then re-queue the message for delivery to another consumer. Note that an ack can mean: (1) a consumer received the message (redundancy for the TCP syn-ack) or (2) consumer successfully processed the message (may not be needed if the caller is expecting a response to the request) -- the semantics is for us to decide but we should be consistent.

The remainder of this page describes a few scenarios for how VRO microservices could interact with RabbitMQ to ensure reliability (or not), starting with some obviously unworkable ones.

No acknowledgment

The microservice consumes requests with auto-ack on, and performs the work without communicating out about any success or failure.

sequenceDiagram
    App->>RabbitMQ: publish request
    RabbitMQ->>Microservice: consume & auto-ack request
    Note right of Microservice: do some work

This does not provide any safeguards in case of failure, and should only be used if the microservice is intended to perform optional "best-effort" work.

Explicit acknowledgment

The most basic way to address the above issue is to have the microservice explicitly acknowledge that it performed its work successfully.

sequenceDiagram
    App->>RabbitMQ: publish request
    RabbitMQ->>Microservice: consume request
    Note right of Microservice: do some work
    Microservice->>RabbitMQ: ack request

See the re-queuing section below for a description of RabbitMQ's behavior if it never receives the explicit ack.

Responding to app

Currently, every microservice in VRO is invoked by a caller (for example the API server, labeled here as "App") that depends on some sort of result from the microservice. As with the request sent to the microservice, the response is published to RabbitMQ, and subsequently consumed by the caller.

sequenceDiagram
    App->>RabbitMQ: publish request
    RabbitMQ->>Microservice: consume & auto-ack request
    Note right of Microservice: do some work
    Microservice->>RabbitMQ: publish response
    RabbitMQ->>App: consume response

In this case, the response message also serves conceptually as an acknowledgment from the microservice that it performed its work successfully. The next two scenarios describe what happens when a failure occurs.

Responding to App with error

If the microservice encounters a failure and can recover (in Java and Python microservices that wrap the entire task in a try/catch, this likely accounts for nearly all failures), the microservice can send an error response back to the app.

sequenceDiagram
    App->>RabbitMQ: publish request
    RabbitMQ->>Microservice: consume & auto-ack request
    Note right of Microservice: try to do some work
    Note right of Microservice: handle a failure
    Microservice->>RabbitMQ: publish error response
    RabbitMQ->>App: consume error response

This is nearly identical to the previous scenario, just with different content in the response message -- the error response acts like the ack.

Letting the App time out

Occasionally there are failures can't be handled by the microservice, e.g. if its container gets killed by OOM. In this case, the microservice isn't able to send back any response, so it's up to the App to time-out after a reasonable amount of time.

sequenceDiagram
    App->>RabbitMQ: publish request
    RabbitMQ->>Microservice: consume & auto-ack request
    Note right of Microservice: unhandled failure
    Note left of App: timeout waiting for response

This works well when the App itself is bound by a request/response context that is expected to return within a reasonable amount of time. Other callers/requesters must ensure to set a response timeout and handle it by resubmitting the request (itself or via RabbitMQ's features) or raising an error for the originating caller to handle.

Re-queuing asynchronous work

In the event that the microservice is responsible for work that finishes at some indeterminate time in the future, RabbitMQ's re-queuing of unacknowledged messages can be used. When the microservice consumes messages with explicit acks, RabbitMQ will wait a certain amount of time (30 minutes by default) for the ack, and re-queue and re-deliver the message after a time-out.

sequenceDiagram
    participant App
    participant RabbitMQ
    participant MS1 as Microservice
    participant MS2 as Microservice
    App->>RabbitMQ: publish request
    RabbitMQ->>MS1: consume request
    Note right of MS1: unhandled failure
    Note right of RabbitMQ: timeout waiting for ack
    RabbitMQ->>MS2: consume request
    Note right of MS2: do some work
    MS2->>RabbitMQ: ack request

This sequence diagram shows the flow when a second microservice is available to consume the message.