DLQ Replay - sgajbi/portfolio-analytics-system GitHub Wiki

Overview

A Dead Letter Queue (DLQ) is used in the system to capture and retain events that fail processing in any microservice (e.g., malformed data, downstream DB errors). DLQs enable safe operational recovery and prevent data loss or pipeline blockages.

This guide covers:

  • How DLQs are used in the platform
  • How to replay DLQ messages using the provided tool
  • Troubleshooting failed events

DLQ in the Event Pipeline

  • Each service (calculator, persistence, etc.) writes failed events to its own DLQ Kafka topic (e.g., cost_calculator_dlq, cashflow_calculator_dlq).
  • DLQ messages include:
    • The original event payload
    • Error details / stack trace
    • Correlation ID for tracing

Replay Tool: tools/dlq_replayer.py

A CLI tool is provided for replaying failed events from any DLQ topic back to the main input topic after issues are fixed.

Usage

Prerequisites:

  • Ensure the root .env file is set up with KAFKA_BOOTSTRAP_SERVERS_HOST and any needed topics.

Run the replayer:

python tools/dlq_replayer.py \
  --dlq-topic cost_calculator_dlq \
  --main-topic raw_transactions_completed \
  --max-messages 10

Arguments:

  • --dlq-topic: The DLQ Kafka topic to read from.
  • --main-topic: The topic to replay messages to (usually the original input topic).
  • --max-messages: (optional) Number of messages to replay (default: all).

Common Replay Flow

  1. Identify and fix root cause in the target service (code, config, infra, or bad data).
  2. Replay failed events using the DLQ replayer tool.
  3. Monitor processing in logs and check that reprocessed events are handled correctly and removed from the DLQ.

Troubleshooting Checklist

  • Service keeps writing to DLQ:

    • Check logs for stack trace or error reason.
    • Use correlation ID to trace the event through all pipeline services.
    • Confirm DB schema matches expected input.
    • Validate event payload for schema mismatches.
    • If the failure is due to data, correct at source and replay.
  • Replay fails:

    • Ensure Kafka is healthy and accessible.
    • Validate topic names and permissions.
    • Check Python environment for missing packages.
  • Pipeline blocked:

    • If DLQ fills up or is not draining, consider scaling service or partitioning the backlog replay.

Operational Best Practices

  • Replay in small batches to avoid overwhelming downstream services.
  • Always monitor service logs and DLQ topic length during/after replay.
  • Do not delete DLQ topics without operational review; they provide valuable forensic evidence for root cause analysis.

Related Pages