kafka fail‐over - unix1998/technical_notes GitHub Wiki

Kafka is designed to handle failover and maintain high availability. Let's go over the concepts you've mentioned and clarify how Kafka manages data and failover:

1. Brokers and Partitions:

Brokers: Kafka brokers are the servers that store data and serve client requests. Each broker can host multiple partitions.
Partitions: A topic is divided into partitions, which are the basic unit of parallelism in Kafka. Each partition is an ordered, immutable sequence of records.

2. Replication:

Kafka provides high availability and fault tolerance through replication. Each partition of a topic can have multiple replicas, which are distributed across different brokers. One replica is the leader, and the others are followers.

3. Leader and Followers:

Leader: Handles all read and write requests for the partition.
Followers: Replicate the leader's data and can take over if the leader fails.

4. Failover Mechanism:

When a broker fails, Kafka's failover mechanism ensures that the system remains operational:

If a broker that hosts a leader partition goes down, one of the in-sync replicas (followers) on another broker is automatically promoted to be the new leader.
Clients will now interact with the new leader, and the system continues to function without data loss or downtime.

5. Replication Factor:

The replication factor (RF) determines the number of copies of data Kafka maintains. For example, with an RF of 3, each partition will have three copies distributed across three brokers. This ensures that even if one broker fails, there are still two copies of the data available.

Example Scenario:

If you have a topic with a replication factor of 3 and three brokers, the partitions for that topic are distributed across all brokers, with each broker holding both leader and follower replicas for different partitions. In case of a broker failure:

The data stored on the failed broker is still available through the replicas on the other brokers.
Kafka will elect a new leader for any partitions that had their leader on the failed broker, and clients will be directed to the new leader.

So, even if a broker fails, as long as the replication factor is greater than 1, there will be no data loss, and Kafka can continue to operate.