Kafka - tech9tel/systemdesign GitHub Wiki

πŸŒ€ What is Apache Kafka?

Apache Kafka is an open-source, distributed event streaming platform used to build real-time data pipelines and streaming applications. It’s designed for high-throughput, low-latency, and fault-tolerant messaging between systems.


πŸ“¦ Core Concepts

  • Producer: Sends (publishes) data to Kafka.
  • Consumer: Reads data from Kafka.
  • Topic: Named data channel where messages are published.
  • Partition: A topic is split into multiple parts for scalability.
  • Broker: Kafka server that stores and serves data.
  • Offset: Sequential ID of a message in a partition.

🧠 Key Features

  • Log-based: Messages are stored in an append-only log.
  • Durable: Data persists on disk and can be replayed.
  • Scalable: Easily handles millions of messages per second.
  • Fault-tolerant: Replication across brokers ensures reliability.
  • Distributed: Works as a cluster for performance and resilience.

πŸš€ Use Cases

  • Real-time analytics (e.g., user behavior, fraud detection)
  • Log aggregation and monitoring
  • Event sourcing in microservices
  • Data ingestion pipelines (to S3, HDFS, DBs)
  • Stream processing (with Kafka Streams or Flink)

πŸ“Š Kafka is Best For:

When You Need... Kafka is a Great Fit βœ…
High-throughput streaming βœ…
Real-time event-driven architecture βœ…
Durable logs and replay capability βœ…
Horizontal scalability βœ…

βš–οΈ Kafka is Not Ideal For:

Not Ideal When... Consider Alternatives
You need push-based message queues RabbitMQ, ActiveMQ
You require strict ordering across all messages Pulsar or custom queues
You're doing simple task queueing Celery, SQS

πŸ“š Learn More

πŸŒ€ Kafka Terminologies – Quick Reference with Alternatives

Kafka Term Description Also Known As (Other Tools/Models)
Producer Sends (publishes) messages to a Kafka topic Publisher (Pub/Sub), Sender (RabbitMQ), Data Source
Consumer Reads messages from Kafka topics Subscriber (Pub/Sub), Listener (RabbitMQ), Reader
Topic Logical channel where messages are published Stream (Kinesis), Channel (Pub/Sub), Queue (RabbitMQ)
Partition A topic is split into partitions for scalability Shard (Kinesis, DynamoDB), Queue Partition
Offset Sequential ID of messages within a partition Sequence Number (Kinesis), Message Index
Broker Kafka server that stores and serves messages Node (Pulsar), Queue Manager (RabbitMQ), Shard (Kinesis)
Cluster Set of brokers working together Same (common term across tools)
Consumer Group Group of consumers sharing workload of reading a topic Subscription Group (Pulsar), Listener Pool
Leader The partition broker that handles read/writes Primary (DB), Master (older term)
Follower Replica broker that syncs from the leader Replica, Secondary
Replication Factor Number of times data is copied across brokers Redundancy Level, Data Replicas
ZooKeeper External service for metadata and coordination (legacy) etcd (in Pulsar/K8s), Control Plane (modern Kafka uses KRaft)
KRaft Kafka's native metadata quorum replacing ZooKeeper Built-in Consensus, Kafka Controller Node
Producer Acknowledgment (acks) Level of delivery guarantee by producer Durable Writes (DB), Delivery Modes (Pub/Sub)
Retention How long Kafka stores data in the topic TTL (Time to Live), Message Expiry
Lag The delay between latest message and consumer read position Backlog, Delay
Compaction Retains only the latest value for a key Upsert (DB), Merge (Log Compaction in Pulsar)
Kafka Streams Library for real-time stream processing in Kafka Apache Flink, Spark Streaming, KSQL
Connectors Kafka Connect uses connectors for source/sink integrations Integrations, Adapters (NiFi, Logstash, Airbyte)

🧠 Summary

Kafka terminology maps well to distributed messaging, pub/sub, and streaming models. Most terms have equivalents in other systems, but Kafka’s architecture is log-based, which sets it apart from traditional queuing systems like RabbitMQ.

πŸ”„ Kafka vs RabbitMQ – Messaging System Comparison

Apache Kafka and RabbitMQ are both widely-used messaging systems, but they serve different use cases and operate under different paradigms.


🧱 Overview

Feature Apache Kafka RabbitMQ
Type Distributed Event Streaming Platform General-purpose Message Broker (Queue-based)
Message Model Pub/Sub with persistent logs Queue-based (Push + Pub/Sub optional)
Use Case High-throughput event streaming Task queuing, transactional messaging
Delivery Model Pull-based (consumers pull messages) Push-based (broker pushes to consumers)
Protocol Custom TCP, Kafka protocol AMQP (default), MQTT, STOMP, HTTP

βš™οΈ Architecture & Durability

Feature Kafka RabbitMQ
Persistence Durable log storage on disk Durable queues/messages (configurable)
Message Ordering Ordered within a partition No guarantee unless using one queue
Retention Time-based or size-based (log-style) Deleted after consumption (default)
Scalability Horizontally scalable (partitions) Vertical or cluster-based (less seamless)
Backpressure Handling Handled at consumer level Can drop or block producers if queues fill

πŸš€ Performance & Throughput

Feature Kafka RabbitMQ
Throughput Very high (millions/sec) Moderate (100k–1M msg/sec range)
Latency Low for streaming, not real-time Low (good for transactional jobs)
Efficiency High for large event volumes Better for small, fast messages

πŸ” Security & Reliability

Feature Kafka RabbitMQ
Security TLS, SASL, ACLs TLS, OAuth2, fine-grained policies
High Availability Replication via partitions HA queues with mirroring
Reliability At least once (default), configurable At most/least/exactly once modes

🧰 Tooling & Ecosystem

Feature Kafka RabbitMQ
Tooling Kafka Streams, Kafka Connect, ksqlDB Shovel, Federation, Management UI
Monitoring Prometheus, Confluent tools, Grafana RabbitMQ Management Plugin, Prometheus
Client Libraries Java, Python, Go, etc. (native protocol) Java, Python, Go, Node, etc. (via AMQP)

βœ… Pros & Cons

Apache Kafka

βœ… Pros:

  • High-throughput event streaming
  • Durable log storage and reprocessing
  • Scalable and partitioned architecture

❌ Cons:

  • More complex to operate and manage
  • Requires careful tuning and tooling

RabbitMQ

βœ… Pros:

  • Easy to set up and operate
  • Flexible routing with exchanges
  • Supports multiple protocols and QoS

❌ Cons:

  • Limited for big data/streaming use cases
  • Harder to scale horizontally

πŸ€” When to Use What?

Scenario Recommended Tool
Event-driven, high-volume data pipelines βœ… Kafka
Real-time user actions, IoT, logs, analytics βœ… Kafka
Task/work queueing (e.g., email, image processing) βœ… RabbitMQ
Transactional or low-latency message handling βœ… RabbitMQ
You want log-style reprocessing & replay βœ… Kafka
You need flexible routing & multi-protocol βœ… RabbitMQ