Kafka - tech9tel/systemdesign GitHub Wiki
π What is Apache Kafka?
Apache Kafka is an open-source, distributed event streaming platform used to build real-time data pipelines and streaming applications. Itβs designed for high-throughput, low-latency, and fault-tolerant messaging between systems.
π¦ Core Concepts
- Producer: Sends (publishes) data to Kafka.
- Consumer: Reads data from Kafka.
- Topic: Named data channel where messages are published.
- Partition: A topic is split into multiple parts for scalability.
- Broker: Kafka server that stores and serves data.
- Offset: Sequential ID of a message in a partition.
π§ Key Features
- Log-based: Messages are stored in an append-only log.
- Durable: Data persists on disk and can be replayed.
- Scalable: Easily handles millions of messages per second.
- Fault-tolerant: Replication across brokers ensures reliability.
- Distributed: Works as a cluster for performance and resilience.
π Use Cases
- Real-time analytics (e.g., user behavior, fraud detection)
- Log aggregation and monitoring
- Event sourcing in microservices
- Data ingestion pipelines (to S3, HDFS, DBs)
- Stream processing (with Kafka Streams or Flink)
π Kafka is Best For:
When You Need... | Kafka is a Great Fit β |
---|---|
High-throughput streaming | β |
Real-time event-driven architecture | β |
Durable logs and replay capability | β |
Horizontal scalability | β |
βοΈ Kafka is Not Ideal For:
Not Ideal When... | Consider Alternatives |
---|---|
You need push-based message queues | RabbitMQ, ActiveMQ |
You require strict ordering across all messages | Pulsar or custom queues |
You're doing simple task queueing | Celery, SQS |
π Learn More
π Kafka Terminologies β Quick Reference with Alternatives
Kafka Term | Description | Also Known As (Other Tools/Models) |
---|---|---|
Producer | Sends (publishes) messages to a Kafka topic | Publisher (Pub/Sub), Sender (RabbitMQ), Data Source |
Consumer | Reads messages from Kafka topics | Subscriber (Pub/Sub), Listener (RabbitMQ), Reader |
Topic | Logical channel where messages are published | Stream (Kinesis), Channel (Pub/Sub), Queue (RabbitMQ) |
Partition | A topic is split into partitions for scalability | Shard (Kinesis, DynamoDB), Queue Partition |
Offset | Sequential ID of messages within a partition | Sequence Number (Kinesis), Message Index |
Broker | Kafka server that stores and serves messages | Node (Pulsar), Queue Manager (RabbitMQ), Shard (Kinesis) |
Cluster | Set of brokers working together | Same (common term across tools) |
Consumer Group | Group of consumers sharing workload of reading a topic | Subscription Group (Pulsar), Listener Pool |
Leader | The partition broker that handles read/writes | Primary (DB), Master (older term) |
Follower | Replica broker that syncs from the leader | Replica, Secondary |
Replication Factor | Number of times data is copied across brokers | Redundancy Level, Data Replicas |
ZooKeeper | External service for metadata and coordination (legacy) | etcd (in Pulsar/K8s), Control Plane (modern Kafka uses KRaft) |
KRaft | Kafka's native metadata quorum replacing ZooKeeper | Built-in Consensus, Kafka Controller Node |
Producer Acknowledgment (acks) | Level of delivery guarantee by producer | Durable Writes (DB), Delivery Modes (Pub/Sub) |
Retention | How long Kafka stores data in the topic | TTL (Time to Live), Message Expiry |
Lag | The delay between latest message and consumer read position | Backlog, Delay |
Compaction | Retains only the latest value for a key | Upsert (DB), Merge (Log Compaction in Pulsar) |
Kafka Streams | Library for real-time stream processing in Kafka | Apache Flink, Spark Streaming, KSQL |
Connectors | Kafka Connect uses connectors for source/sink integrations | Integrations, Adapters (NiFi, Logstash, Airbyte) |
π§ Summary
Kafka terminology maps well to distributed messaging, pub/sub, and streaming models. Most terms have equivalents in other systems, but Kafkaβs architecture is log-based, which sets it apart from traditional queuing systems like RabbitMQ.
π Kafka vs RabbitMQ β Messaging System Comparison
Apache Kafka and RabbitMQ are both widely-used messaging systems, but they serve different use cases and operate under different paradigms.
π§± Overview
Feature | Apache Kafka | RabbitMQ |
---|---|---|
Type | Distributed Event Streaming Platform | General-purpose Message Broker (Queue-based) |
Message Model | Pub/Sub with persistent logs | Queue-based (Push + Pub/Sub optional) |
Use Case | High-throughput event streaming | Task queuing, transactional messaging |
Delivery Model | Pull-based (consumers pull messages) | Push-based (broker pushes to consumers) |
Protocol | Custom TCP, Kafka protocol | AMQP (default), MQTT, STOMP, HTTP |
βοΈ Architecture & Durability
Feature | Kafka | RabbitMQ |
---|---|---|
Persistence | Durable log storage on disk | Durable queues/messages (configurable) |
Message Ordering | Ordered within a partition | No guarantee unless using one queue |
Retention | Time-based or size-based (log-style) | Deleted after consumption (default) |
Scalability | Horizontally scalable (partitions) | Vertical or cluster-based (less seamless) |
Backpressure Handling | Handled at consumer level | Can drop or block producers if queues fill |
π Performance & Throughput
Feature | Kafka | RabbitMQ |
---|---|---|
Throughput | Very high (millions/sec) | Moderate (100kβ1M msg/sec range) |
Latency | Low for streaming, not real-time | Low (good for transactional jobs) |
Efficiency | High for large event volumes | Better for small, fast messages |
π Security & Reliability
Feature | Kafka | RabbitMQ |
---|---|---|
Security | TLS, SASL, ACLs | TLS, OAuth2, fine-grained policies |
High Availability | Replication via partitions | HA queues with mirroring |
Reliability | At least once (default), configurable | At most/least/exactly once modes |
π§° Tooling & Ecosystem
Feature | Kafka | RabbitMQ |
---|---|---|
Tooling | Kafka Streams, Kafka Connect, ksqlDB | Shovel, Federation, Management UI |
Monitoring | Prometheus, Confluent tools, Grafana | RabbitMQ Management Plugin, Prometheus |
Client Libraries | Java, Python, Go, etc. (native protocol) | Java, Python, Go, Node, etc. (via AMQP) |
β Pros & Cons
Apache Kafka
β Pros:
- High-throughput event streaming
- Durable log storage and reprocessing
- Scalable and partitioned architecture
β Cons:
- More complex to operate and manage
- Requires careful tuning and tooling
RabbitMQ
β Pros:
- Easy to set up and operate
- Flexible routing with exchanges
- Supports multiple protocols and QoS
β Cons:
- Limited for big data/streaming use cases
- Harder to scale horizontally
π€ When to Use What?
Scenario | Recommended Tool |
---|---|
Event-driven, high-volume data pipelines | β Kafka |
Real-time user actions, IoT, logs, analytics | β Kafka |
Task/work queueing (e.g., email, image processing) | β RabbitMQ |
Transactional or low-latency message handling | β RabbitMQ |
You want log-style reprocessing & replay | β Kafka |
You need flexible routing & multi-protocol | β RabbitMQ |