Kafka - tech9tel/systemdesign GitHub Wiki

🌀 What is Apache Kafka?

Apache Kafka is an open-source, distributed event streaming platform used to build real-time data pipelines and streaming applications. It’s designed for high-throughput, low-latency, and fault-tolerant messaging between systems.

📦 Core Concepts

Producer: Sends (publishes) data to Kafka.
Consumer: Reads data from Kafka.
Topic: Named data channel where messages are published.
Partition: A topic is split into multiple parts for scalability.
Broker: Kafka server that stores and serves data.
Offset: Sequential ID of a message in a partition.

🧠 Key Features

Log-based: Messages are stored in an append-only log.
Durable: Data persists on disk and can be replayed.
Scalable: Easily handles millions of messages per second.
Fault-tolerant: Replication across brokers ensures reliability.
Distributed: Works as a cluster for performance and resilience.

🚀 Use Cases

Real-time analytics (e.g., user behavior, fraud detection)
Log aggregation and monitoring
Event sourcing in microservices
Data ingestion pipelines (to S3, HDFS, DBs)
Stream processing (with Kafka Streams or Flink)

📊 Kafka is Best For:

When You Need...	Kafka is a Great Fit ✅
High-throughput streaming	✅
Real-time event-driven architecture	✅
Durable logs and replay capability	✅
Horizontal scalability	✅

⚖️ Kafka is Not Ideal For:

Not Ideal When...	Consider Alternatives
You need push-based message queues	RabbitMQ, ActiveMQ
You require strict ordering across all messages	Pulsar or custom queues
You're doing simple task queueing	Celery, SQS

📚 Learn More

🌀 Kafka Terminologies – Quick Reference with Alternatives

Kafka Term	Description	Also Known As (Other Tools/Models)
Producer	Sends (publishes) messages to a Kafka topic	Publisher (Pub/Sub), Sender (RabbitMQ), Data Source
Consumer	Reads messages from Kafka topics	Subscriber (Pub/Sub), Listener (RabbitMQ), Reader
Topic	Logical channel where messages are published	Stream (Kinesis), Channel (Pub/Sub), Queue (RabbitMQ)
Partition	A topic is split into partitions for scalability	Shard (Kinesis, DynamoDB), Queue Partition
Offset	Sequential ID of messages within a partition	Sequence Number (Kinesis), Message Index
Broker	Kafka server that stores and serves messages	Node (Pulsar), Queue Manager (RabbitMQ), Shard (Kinesis)
Cluster	Set of brokers working together	Same (common term across tools)
Consumer Group	Group of consumers sharing workload of reading a topic	Subscription Group (Pulsar), Listener Pool
Leader	The partition broker that handles read/writes	Primary (DB), Master (older term)
Follower	Replica broker that syncs from the leader	Replica, Secondary
Replication Factor	Number of times data is copied across brokers	Redundancy Level, Data Replicas
ZooKeeper	External service for metadata and coordination (legacy)	etcd (in Pulsar/K8s), Control Plane (modern Kafka uses KRaft)
KRaft	Kafka's native metadata quorum replacing ZooKeeper	Built-in Consensus, Kafka Controller Node
Producer Acknowledgment (acks)	Level of delivery guarantee by producer	Durable Writes (DB), Delivery Modes (Pub/Sub)
Retention	How long Kafka stores data in the topic	TTL (Time to Live), Message Expiry
Lag	The delay between latest message and consumer read position	Backlog, Delay
Compaction	Retains only the latest value for a key	Upsert (DB), Merge (Log Compaction in Pulsar)
Kafka Streams	Library for real-time stream processing in Kafka	Apache Flink, Spark Streaming, KSQL
Connectors	Kafka Connect uses connectors for source/sink integrations	Integrations, Adapters (NiFi, Logstash, Airbyte)

🧠 Summary

Kafka terminology maps well to distributed messaging, pub/sub, and streaming models. Most terms have equivalents in other systems, but Kafka’s architecture is log-based, which sets it apart from traditional queuing systems like RabbitMQ.

🔄 Kafka vs RabbitMQ – Messaging System Comparison

Apache Kafka and RabbitMQ are both widely-used messaging systems, but they serve different use cases and operate under different paradigms.

🧱 Overview

Feature	Apache Kafka	RabbitMQ
Type	Distributed Event Streaming Platform	General-purpose Message Broker (Queue-based)
Message Model	Pub/Sub with persistent logs	Queue-based (Push + Pub/Sub optional)
Use Case	High-throughput event streaming	Task queuing, transactional messaging
Delivery Model	Pull-based (consumers pull messages)	Push-based (broker pushes to consumers)
Protocol	Custom TCP, Kafka protocol	AMQP (default), MQTT, STOMP, HTTP

⚙️ Architecture & Durability

Feature	Kafka	RabbitMQ
Persistence	Durable log storage on disk	Durable queues/messages (configurable)
Message Ordering	Ordered within a partition	No guarantee unless using one queue
Retention	Time-based or size-based (log-style)	Deleted after consumption (default)
Scalability	Horizontally scalable (partitions)	Vertical or cluster-based (less seamless)
Backpressure Handling	Handled at consumer level	Can drop or block producers if queues fill

🚀 Performance & Throughput

Feature	Kafka	RabbitMQ
Throughput	Very high (millions/sec)	Moderate (100k–1M msg/sec range)
Latency	Low for streaming, not real-time	Low (good for transactional jobs)
Efficiency	High for large event volumes	Better for small, fast messages

🔐 Security & Reliability

Feature	Kafka	RabbitMQ
Security	TLS, SASL, ACLs	TLS, OAuth2, fine-grained policies
High Availability	Replication via partitions	HA queues with mirroring
Reliability	At least once (default), configurable	At most/least/exactly once modes

🧰 Tooling & Ecosystem

Feature	Kafka	RabbitMQ
Tooling	Kafka Streams, Kafka Connect, ksqlDB	Shovel, Federation, Management UI
Monitoring	Prometheus, Confluent tools, Grafana	RabbitMQ Management Plugin, Prometheus
Client Libraries	Java, Python, Go, etc. (native protocol)	Java, Python, Go, Node, etc. (via AMQP)

✅ Pros & Cons

Apache Kafka

✅ Pros:

High-throughput event streaming
Durable log storage and reprocessing
Scalable and partitioned architecture

❌ Cons:

More complex to operate and manage
Requires careful tuning and tooling

RabbitMQ

✅ Pros:

Easy to set up and operate
Flexible routing with exchanges
Supports multiple protocols and QoS

❌ Cons:

Limited for big data/streaming use cases
Harder to scale horizontally

🤔 When to Use What?

Scenario	Recommended Tool
Event-driven, high-volume data pipelines	✅ Kafka
Real-time user actions, IoT, logs, analytics	✅ Kafka
Task/work queueing (e.g., email, image processing)	✅ RabbitMQ
Transactional or low-latency message handling	✅ RabbitMQ
You want log-style reprocessing & replay	✅ Kafka
You need flexible routing & multi-protocol	✅ RabbitMQ