Kafka - fault tolerance
Create a topic with only one partition.
- Spin up the cluster: https://github.com/explorer436/programming-playground/tree/main/docker%20compose%20files/kafka/03-kafka-cluster-setup
- In any one of the nodes on the cluster, create a topic with only one partition and replication factor of 3.
root@49adaa34c939:/learning# kafka-topics.sh --bootstrap-server localhost:9092 --create --replication-factor 3 --partitions 1 --topic first_kafka_topic
WARNING: Due to limitations in metric names, topics with a period ('.') or underscore ('_') could collide. To avoid issues it is best to use either, but not both.
Created topic first_kafka_topic.
root@49adaa34c939:/learning# kafka-topics.sh --bootstrap-server localhost:9092 --describe --topic first_kafka_topic
Topic: first_kafka_topic TopicId: Sm09WOyLTa-sU2q6BkE-Sg PartitionCount: 1 ReplicationFactor: 3 Configs:
Topic: first_kafka_topic Partition: 0 Leader: 1 Replicas: 1,2,3 Isr: 1,2,3
root@49adaa34c939:/learning#
- Start the Consumer and notice the leader in the Consumer logs.
15:10:12.482 [reactive-kafka-demo-group-1] INFO o.a.k.c.c.i.SubscriptionState - [Consumer instanceId=1, clientId=consumer-demo-group-1, groupId=demo-group] Resetting offset for partition first_kafka_topic-0 to position FetchPosition{offset=0, offsetEpoch=Optional.empty, currentLeader=LeaderAndEpoch{leader=Optional[localhost:8081 (id: 1 rack: null)], epoch=0}}.
- Start the Producer. Notice that the Consumer will start picking up messages from the queue in real-time as the Producer keeps writing to it.
- Now kill the docker container 1 - because it is the leader.
- Notice the producer logs
Even though the Leader broker is killed, the producer will not stop producing.15:14:40.154 [kafka-producer-network-thread | producer-1] INFO k.e.r.k.p.f.MyKafkaProducer - correlation id: 403 15:14:40.201 [kafka-producer-network-thread | producer-1] INFO o.apache.kafka.clients.NetworkClient - [Producer clientId=producer-1] Node -1 disconnected. 15:14:40.204 [kafka-producer-network-thread | producer-1] INFO o.apache.kafka.clients.NetworkClient - [Producer clientId=producer-1] Node 1 disconnected. 15:14:40.204 [kafka-producer-network-thread | producer-1] INFO o.apache.kafka.clients.NetworkClient - [Producer clientId=producer-1] Cancelled in-flight METADATA request with correlation id 406 due to node 1 being disconnected (elapsed time since creation: 1ms, elapsed time since send: 1ms, request timeout: 30000ms) 15:14:40.204 [kafka-producer-network-thread | producer-1] INFO k.e.r.k.p.f.MyKafkaProducer - correlation id: 404
- Notice the consumer logs
Even though the Leader broker is killed, the consumer will not stop consuming.15:14:40.257 [reactive-kafka-demo-group-1] INFO o.apache.kafka.clients.NetworkClient - [Consumer instanceId=1, clientId=consumer-demo-group-1, groupId=demo-group] Node -1 disconnected.
- Describe the topic again
The Leader is changed. The replicas changed.root@d8099c1dcb40:/learning# kafka-topics.sh --bootstrap-server localhost:9092 --describe --topic first_kafka_topic Topic: first_kafka_topic TopicId: Sm09WOyLTa-sU2q6BkE-Sg PartitionCount: 1 ReplicationFactor: 3 Configs: Topic: first_kafka_topic Partition: 0 Leader: 2 Replicas: 1,2,3 Isr: 2,3
There are many combinations that we can try here.
- Kill broker2
- The takeaway is that, even though we kill two of the three brokers, the producer and the consumer will still keep working.
- Kill broker3
- If we kill the last standing broker in the cluster, that is a problem. This is a scenario that needs to be addressed with priority.