Kafka - fault tolerance

Create a topic with only one partition.

  1. Spin up the cluster: https://github.com/explorer436/programming-playground/tree/main/docker%20compose%20files/kafka/03-kafka-cluster-setup
  2. In any one of the nodes on the cluster, create a topic with only one partition and replication factor of 3.
root@49adaa34c939:/learning# kafka-topics.sh --bootstrap-server localhost:9092 --create --replication-factor 3 --partitions 1 --topic first_kafka_topic
WARNING: Due to limitations in metric names, topics with a period ('.') or underscore ('_') could collide. To avoid issues it is best to use either, but not both.
Created topic first_kafka_topic.
root@49adaa34c939:/learning# kafka-topics.sh --bootstrap-server localhost:9092 --describe --topic first_kafka_topic
Topic: first_kafka_topic	TopicId: Sm09WOyLTa-sU2q6BkE-Sg	PartitionCount: 1	ReplicationFactor: 3	Configs:
        Topic: first_kafka_topic	Partition: 0	Leader: 1	Replicas: 1,2,3	Isr: 1,2,3
root@49adaa34c939:/learning#
  1. Start the Consumer and notice the leader in the Consumer logs.
    15:10:12.482 [reactive-kafka-demo-group-1] INFO  o.a.k.c.c.i.SubscriptionState - [Consumer instanceId=1, clientId=consumer-demo-group-1, groupId=demo-group] Resetting offset for partition first_kafka_topic-0 to position FetchPosition{offset=0, offsetEpoch=Optional.empty, currentLeader=LeaderAndEpoch{leader=Optional[localhost:8081 (id: 1 rack: null)], epoch=0}}.
    
  2. Start the Producer. Notice that the Consumer will start picking up messages from the queue in real-time as the Producer keeps writing to it.
  3. Now kill the docker container 1 - because it is the leader.
  4. Notice the producer logs
    15:14:40.154 [kafka-producer-network-thread | producer-1] INFO  k.e.r.k.p.f.MyKafkaProducer - correlation id: 403
    15:14:40.201 [kafka-producer-network-thread | producer-1] INFO  o.apache.kafka.clients.NetworkClient - [Producer clientId=producer-1] Node -1 disconnected.
    15:14:40.204 [kafka-producer-network-thread | producer-1] INFO  o.apache.kafka.clients.NetworkClient - [Producer clientId=producer-1] Node 1 disconnected.
    15:14:40.204 [kafka-producer-network-thread | producer-1] INFO  o.apache.kafka.clients.NetworkClient - [Producer clientId=producer-1] Cancelled in-flight METADATA request with correlation id 406 due to node 1 being disconnected (elapsed time since creation: 1ms, elapsed time since send: 1ms, request timeout: 30000ms)
    15:14:40.204 [kafka-producer-network-thread | producer-1] INFO  k.e.r.k.p.f.MyKafkaProducer - correlation id: 404
    
    Even though the Leader broker is killed, the producer will not stop producing.
  5. Notice the consumer logs
    15:14:40.257 [reactive-kafka-demo-group-1] INFO  o.apache.kafka.clients.NetworkClient - [Consumer instanceId=1, clientId=consumer-demo-group-1, groupId=demo-group] Node -1 disconnected.
    
    Even though the Leader broker is killed, the consumer will not stop consuming.
  6. Describe the topic again
    root@d8099c1dcb40:/learning# kafka-topics.sh --bootstrap-server localhost:9092 --describe --topic first_kafka_topic
    Topic: first_kafka_topic	TopicId: Sm09WOyLTa-sU2q6BkE-Sg	PartitionCount: 1	ReplicationFactor: 3	Configs:
        Topic: first_kafka_topic	Partition: 0	Leader: 2	Replicas: 1,2,3	Isr: 2,3
    
    The Leader is changed. The replicas changed.

There are many combinations that we can try here.

  1. Kill broker2
    1. The takeaway is that, even though we kill two of the three brokers, the producer and the consumer will still keep working.
  2. Kill broker3
    1. If we kill the last standing broker in the cluster, that is a problem. This is a scenario that needs to be addressed with priority.

Links to this note