Kafka - kdwivedi1985/system-design GitHub Wiki

Kafka and It's usage:

  • Kafka is open-source distributed event streaming platform.
  • It provides durability and reliability as long as you want.
  • Used for real-time data pipeline, streaming, messaging, website activity tracking, Metrics and Logging(aggregate and process logs and metrics from distributed systems).

image

  • What is event Streaming?

    • Event streaming is capturing data in real-time from event sources like- DB, sensors, applications, mobiles etc.

  • How does Kafka works?

    • It is consists of servers and clients which communicates using TCP network protocol.

    • Server: Are distributed and have two categories:

      • Broker Server:
        • Receives messages from producer and send to consumer.
        • Stores events/ messages on disk, and organizes topics and partitions, which distributed data to consumers.
        • One broker is usually elected as the controller, which oversees partition assignments and handles broker failures.
        • Brokers collectively form the storage layer of Kafka.
      • Kafka Connect Servers:
        • Integrate with external systems and other Kafka clusters.
        • It moves data in/out from Kafka.
        • Kafka Connect can ingest entire databases or collect metrics from all your application servers into Kafka topics.
    • Client:

      • These are external applications which connects with Kafka cluster.
      • Producer and Consumer are the client.

What are core components of Kafka?

  • Publisher: Producer that writes the data to Kafka Topic.
  • Subscriber: Consumer which reads the data from Kafka Topic.
  • Topic: A named stream where data is published and subscribed from.
  • Broker: A Kafka server that stores data and serves client requests.
  • Cluster: A group of Kafka brokers working together.
  • Offset: Each record within a partition has a unique ID called an offset, which helps consumers keep track of their position.
  • Partition: Topics are split into partitions, allowing Kafka to scale horizontally. Kafka topics and partitions are logical separations of data.

Does Kafka maintain the ordering of events?

  • A partition key is used to avoid a race condition and maintain the ordering of the event. e.g. customer_id, quote_id can be used partition key, this will ensure that the data for same customer or quote goes to the same partition every time and ensures the processing order.
  • By default, Kafka uses round-robin to distribute messages evenly across the partitions, but it doesn't guarantee the order of the message.
  • By Default, it assign one partition.

  • Does Kafka supports custom joins?

    • Kafka supports joins and other operations, primarily through its Kafka Streams API(inner, left, outer, foreign-key). Kafka also supports manual joins for custom use-cases like data enrichment, event correlation, and building real-time analytics or materialized views within distributed systems.

What is Kafka Cluster and How does that works?

  • A Kafka cluster is a group of interconnected Kafka brokers (servers) that work together to manage, store, and process streams of data in a distributed, scalable, and fault-tolerant way.
  • Single Kafka Cluster: All brokers, topics, and partitions are managed centrally within one cluster.
  • Multi-Cluster Architectures: Multiple clusters can be used for workload isolation, geo-redundancy, or disaster recovery (active-active or active-passive clusters).
  • It provides scalability, Fault tolerance and high-throughput.

How Does Kafka provides high-availability?

  • It provides high-availability based on Replication, leader election, and distributed architecture.
  • Each partition in Kafka can be replicated across multiple brokers (servers) in the cluster. The replication factor determines how many copies of each partition exist (commonly set to 3).
  • One broker acts as the leader for each partition, while others are followers that replicate the leader’s data. If a broker fails, another broker with a replica can take over as leader, ensuring no data loss and high availability.
  • When a topic is created, Kafka’s controller assigns each partition a list of brokers that will hold its replicas. The first broker in the replica list is designated as the "preferred leader" for that partition.
  • Replication is asynchronous.
  • Producers can control durability by sending acks parameters:
    • acks=0: No acknowledgment, lowest durability.
    • acks=1: Leader broker acknowledges; moderate durability.
    • acks=all: All in-sync replicas acknowledge; highest durability
  • When clients consumes the data, it commits the offset and those are also replicated to followers (secondary brokers).
  • Kafka maintains a set of in-sync replicas for each partition—these are brokers that have fully caught up with the leader’s data. Only replicas in the ISR are eligible to become the new leader.

image


What is controller and how Does Kafka selects it?

  • The Kafka controller is a special role assigned to one broker in the cluster, It coordinates with other brokers and manages the cluster.
  • Kafka uses Zookeeper/KRaft service for discovery, cluster coordination, meta-data management and leader selection.
  • Leader election process is based on default Zookeper behavior- Each node creates a /controller node in zookeeper. The first one successed, becomes the controller.
  • It elects partition leader, handles replication, provide broker management, admin task and cluster meta-data management(cluster health etc.).
  • Controller broker in Kafka performs regular broker duties (handling partitions, producers, and consumers) and manages the cluster’s metadata and coordination tasks. image