Kafka

Key features

  1. Distributed event streaming platform
  2. Features
    1. High availability
    2. Horizontally scalable
    3. Ingest large volume of data
    4. High throughput
    5. Low latency
    6. Fault tolerance

kafka application

Similar to databases, kafka is a stateful application. So, it needs directory structure, a location to store the messages, etc. This is where all the key-value pairs from the property files come in. They tell kafka where they are supposed to go (inside the Docker container - if we are using a Docker container).

Why kafka?

Fast, resilient and scalable

With Apache Kafka, streaming data is organized by Kafka topics. Kafka streams offer the same high throughput and high performance of message queues, but with different functionality.

When to use kafka?

Large amount of streaming data that requires scaling and high throughput

Kafka and scaling

Easy horizontal scaling thanks to built-in partitioning

Replication

Kafka - replication

Fault tolerance

Kafka - fault tolerance

High level definitions

Kafka

Basically an event streaming platform. It enables users to collect, store, and process data to build real-time event-driven applications. It is written in Java and Scala, but you don’t have to know these to work with Kafka. There’s also a Python API.

  1. Open-source distributed event streaming platform
  2. Can be used for capturing any events in real-time and storing for later retrieval
  3. Can be used for capturing any events in real-time and processing the events in real-time

Event

Event driven architecture - Event

Kafka streams

  1. A library to build streaming application
  2. Input and Output data is stored in Kafka
  3. Compute aggregation or join streams

Default producer and consumer behavior with leaders

  1. Kafka producers can only write to the leader broker for a partition.
  2. Kafka consumers by default will read from the leader broker for a partition.

Kafka Consumers Replica fetching (newer kafka versions)

  1. Since kafka 2.4, it is possible to configure consumers to read from the closest replica.
  2. This may help improve latency, and also decrease network costs if using the cloud.

Kafka KRaft

  • In 2020, the Apache Kafka project started to work to remove the Zookeeper dependency from it (KIP-500)
  • Zookeeper shows scaling issues when Kafka clusters have > 100,000 partitions.
  • By removing Zookeeper, Apache Kafka can
    • Scale to millions of partitions, and becomes easier to maintain and set-up
    • Improve stability, makes it easier to monitor, support and administer
    • Single security model for the whole system
    • Single process to start with Kafka
    • Faster controller shutdown and recovery time
  • Kafka 3.x now implements the Raft protocal (KRaft) in order to replace Zookeeper

Helpful resources

  1. https://www.conduktor.io/kafka - This is very good.
    1. https://www.conduktor.io/kafka/kafka-sdk-list/
  2. https://medium.com/@TimvanBaarsen/apache-kafka-cli-commands-cheat-sheet-a6f06eac01b
  3. https://www.gentlydownthe.stream/ - A cute children’s book explaining Kafka.

Use cases

Kafka use cases: https://kafka.apache.org/powered-by

Questions

  1. What is the relationship between throughput and topics?
  2. Depending upon the traffic, how many pods should we set up for kafka? e.g. 100,000 requests
  3. Kafka consumer topics - what are they?
  4. If there are 10 consumer instances and if there are more messages coming in the topics than the consumer instances can process, what happens?
  5. What is the relationship between the number of partitions and the number of consumers?
  6. What if something goes wrong with the consumer? What will happen to the messages in the partitions?
  7. Lets say we have 10 consumer instances and a hundred messages are coming in the topic? Explain in detail what happens?
  8. How to write consumers/producers without using spring-cloud-stream? They can be written functionally. How can you write them?
  9. RabbitMQ vs Kafka Messaging Streams - differences - when would be pick one over the other?
  10. If you deliver a message to a kafka topic, all the subscribers of the topic will receive that message. But there is a message that you want to deliver to a topic, and you want only one specific subscriber to pick up that message and the rest of the subscribers should not pick up that message. How will you implement that?
  11. Queues/Topics vs kafka - what is the difference? What advantages does kafka have over traditional queues?
  12. What are kafka topics?
  13. What would be a good scenario to use partitions?
  14. How do you determine how many partitions to use?

TODO

Problems with kafka streams : https://dzone.com/articles/problem-with-kafka-streams-1?fromrel=true

Kafka Racing: Know the Circuit : https://dzone.com/articles/kafka-racing-know-the-circuit

  1. What are kafka containers?
  2. Cloudant - kafka streams or queues
  3. How to publish topic from lambda to kafka stream and from kafka stream to lambda?

Tags

  1. Stream Processing with Apache Kafka
  2. kafka and zookeeper
  3. Event driven architecture - Event
  4. Kafka - CLI and GUI tools
  5. Kafka - Clusters, Controllers and Brokers
  6. Kafka - Consumers and Consumer Groups
  7. Kafka - How are brokers, topics and partitions related
  8. Kafka - Messages
  9. Kafka - PartitionReassignment
  10. Kafka - Producers
  11. Kafka - Serialization and Deserialization
  12. Kafka - Topics, Partitions and Offsets
  13. Kafka - Idempotent Producers and Consumers
  14. Understanding reactor-kafka
  15. Kafka - Integration testing
  16. Kafka - security
  17. Kafka - Transactions
  18. Kafka - Delivery semantics
  19. Kafka - Stream Bridge
  20. Kafka - Fan in and Fan out