Tuesday, December 31, 2024

Apache Kafka for Developers #4: Kafka Producer, Acknowledgements and Idempotency

Producers are clients that publish messages to Kafka topics, distributing them across various partitions. They send data to the broker, which then stores it in the corresponding partition of the topic.

Each message or record that a producer sends includes a Key (optional), Value, Header (optional), and Timestamp.

Message/Record Key

Message keys in Kafka are optional but can be quite beneficial. When a key is provided, Kafka hashes the key to determine the partition. This guarantees that all messages with the same key are directed to the same partition, which is crucial for maintaining order in message processing.

If a Kafka producer does not provide a message key, Kafka distributes the messages evenly across the available partitions using a round-robin algorithm. However, this approach helps balance the load across partitions but does not guarantee order for related messages.

Acknowledgments (acks)

Acknowledgements (acks) in Kafka are a mechanism to ensure that messages are reliably stored in the Kafka topic partition.

Kafka producers only write data to the leader partition in the broker.

Kafka producers must also set the acknowledgment level (acks) to indicate whether a message needs to be written to a minimum number of replicas before it is considered successfully written.

acks=0

The producer sends messages without waiting for any acknowledgment from the broker. While this approach minimizes latency, it also carries a high risk of message loss since the producer doesn't receive confirmation that the message was successfully written.

Message Delivery Semantics

This follows the At Most Once delivery model, where messages are delivered once, and if a failure occurs, they may be lost and not redelivered. This approach is suitable for scenarios where occasional data loss is acceptable and low latency is crucial.

acks=0 // configuration



acks=1 

The producer waits for an acknowledgment from the leader partition only before responding to the client request. This setting provides a balance between latency and reliability. However, the producer does not wait for the all in-sync replicas to be updated with latest data. Therefore, there is a risk of data loss if the leader partition fails before the in-sync replicas are updated.


acks=all

The producer waits for an acknowledgment from the leader Partition and all in-sync partition replicas before responding to the client request. This setting provides highest level of reliability and ensures that the message is replicated across multiple brokers. However, it can increase latency because the producer waits for acknowledgments from multiple brokers.

Message Delivery Semantics

This follows the At Least Once delivery model, where messages are delivered one or more times, and if a failure occurs, messages are not lost but may be delivered more than once. This approach is suitable for scenarios where data loss is unacceptable, and duplicates can be handled.

acks=all, retries=Integer.MAX_VALUE // configuration

Example #1:

  • cluster size: 3
  • replication factor: 3 (including leader)
  • min.insync.replicas: 3 (including leader)

When a producer sends a message to a partition, the leader broker writes the message and waits for acknowledgments from at least two follower replicas (resulting in a total of three acknowledgments: one from the leader and two from followers). Once this condition is met, the message is considered successfully written.

Example #2:
  • cluster size: 3
  • replication factor: 3 (including leader)
  • min.insync.replicas: 2 (including leader)

When a producer sends a message to a partition, the leader broker writes the message and waits for acknowledgments from at least one follower replicas (resulting in a total of two acknowledgments: one from the leader and one from follower). Once this condition is met, the message is considered successfully written.

The widely used configuration is acks=all and min.insync.replicas=2 for ensuring data durability and availability

Idempotency

Idempotency in Kafka producers guarantees that each message is delivered exactly once. This prevents duplicate entries during retries and maintains message ordering.

Without Idempotency

Imagine a scenario where a producer sends a message, but due to a network issue, the acknowledgment from the broker is not received. The producer will retry sending the message. This leads to duplicate message commits.

With Idempotency

Imagine a scenario where a producer sends a message, but due to a network issue, the acknowledgment from the broker is not received. The producer will retry sending the message. Here, the broker will identify the duplicate and discard it, ensuring that only one copy of the message is stored.

Kafka performs the following steps internally to ensure idempotency
  • When idempotency is enabled, each producer is assigned a unique Producer ID (PID)
  • Each message sent by the producer is assigned a monotonically increasing sequence number that is unique to each partition.
  • The broker tracks the highest sequence number it has received from each producer for each partition. If it receives a message with a lower sequence number, it discards it as a duplicate.
Message Delivery Semantics

This follows the Exactly Once delivery model, where messages are exactly once, even in the case of retries. This ensures no duplicates and maintains message order. This approach is suitable for scenarios application requires strict data consistency and no duplicates.

acks=all, enable.idempotence=true // configuration

Monday, December 30, 2024

Apache Kafka for Developers #3: Kafka Topic Replication

Kafka topic replication ensures data durability and high availability by duplicating each partition across multiple brokers in a Kafka cluster.

Kafka follows a leader-follower Replica model in which each partition has single leader and multiple follower replicas. The leader handles all read and write operations, while the followers replicate the data from the leader.

Replication Factor

The replication factor is set at the topic level when the topic is created. It specifies how many copies of each partition will be stored across different brokers in a Kafka cluster

In-Sync Replicas (ISR)

The replicas that have fully synchronized with the leader for a specific partition are known as In-Sync Replicas. This means that all In-Sync Replicas and the leader contain the same data.

Single Broker with Replication Factor

In a Kafka setup with a single broker, the replication factor must be set to 1. This means each partition in Kafka has single leader and zero followers. The replication factor includes the total number of replicas including the leader. There is only one copy of each partition, and no replication occurs. This setup poses a significant risk of data loss and is not advisable for production environments.

 

Multiple Brokers with Replication Factor

In a multi-broker Kafka cluster, the replication factor can be set to a value greater than 1. This means that each partition will have multiple copies distributed across different brokers. It ensures fault tolerance, high availability and data durability.

 

Replication Factor 3 with three brokers and two partitions

Let’s consider a Kafka cluster with 3 brokers and a topic A with 2 partitions (0 and 1). The replication factor is set to 3, meaning each partition will have 3 replicas (including leader).

Kafka Setup
  • Brokers: Kafka cluster consists of 3 brokers, named Broker 1, Broker 2, and Broker 3.
  • Replication Factor: Set to 3, meaning each partition will have 3 replicas (including leader).
  • Topic: topic A with 2 partitions, named Partition 0 and Partition 1.
Partition and Replica Distribution
  • Broker1 contains Leader for Partition 0, Follower for Partition 1
  • Broker2 contains Follower for Partition 0, Leader for Partition 1
  • Broker3 contains Follower for both Partition 0 and Partition 1
Data Flow

When a producer sends data to Topic A with Partition 0, it is first stored into the leader of Partition 0 on Broker 1. The data is then replicated (In-Sync Replica) to the follower partitions on both Broker 2 and Broker 3.

If Broker 1 fails, Kafka will select a new leader for Partition 0 from the in-sync replicas (either Broker 2 or Broker 3). This ensures that the system continues to function properly, maintaining data availability and durability.

Best practices for Kafka replication
  • Starting with a replication factor of three and three brokers in a Kafka cluster is recommended to ensure even data distribution, fault tolerance and can survive the failure of up to two brokers.
  • A good rule of thumb is to have at least as many brokers as the replication factor to ensure even distribution and fault tolerance
  • Avoid setting up too high replication factor which lead to increased resource consumption and network traffic
  • Avoid setting up too low replication factor which can compromise data availability and fault tolerance

Sunday, December 29, 2024

Apache Kafka for Developers #2: Kafka Architecture and Components

Apache Kafka's architecture is designed to handle high-throughput, real-time data streams efficiently and at scale. Here are the key components:




Broker

Kafka operates as a cluster composed of one or more servers, known as brokers. These brokers are responsible for storing data and handling client requests. Each broker has a unique ID and can manage hundreds of thousands of read and write operations per second from thousands of clients.

Topic

A topic is a category where records are stored in the Kafka cluster. Topics are divided into multiple partitions, enabling parallel data processing.

It is like tables in the database.

Partition

A topic divided into multiple partitions for scalability and fault tolerance. Each partition is an ordered, immutable sequence of records that is continually appended to a commit log. Partitions allow Kafka to scale horizontally by distributing data across multiple servers.

Example
  • If a topic has 3 partitions and there are 3 brokers, each broker will have one partition.
  • If a topic has 3 partitions and there are 5 brokers, the first 3 brokers will each have one partition, while the remaining 2 brokers will not have any partitions for that specific topic.
  • If a topic has 3 partitions and there are 2 brokers, each broker will share more than partition which leads to unequal distribution of load.
Partition Offset

Each message within a partition is assigned a unique identifier called an offset. This offset acts as a position marker for the message within the partition.

Offsets are immutable and assigned in a sequential order as messages are produced to a partition.

Consumers use offsets to keep track of their position in a partition. By storing the offset of the last consumed message, consumers can resume reading from the correct position in the event of restart or failure.

Consumers can commit offset automatically at regular intervals or commit manually after processing each message.

Consumers can also start reading from a specific offset based on application needs.

Producer

Producers are clients that publish the messages to Kafka topics. Producers send data to the broker, which then stores it in the appropriate partition of the topic.

Consumer

Consumers are clients that read data from Kafka topics. They subscribe to one or more topics and process the data. Each consumer keeps track of its position in each partition using partition offset.

Consumer Group

A group of consumers work together to consume data from a topic. Each consumer in the group processes data from different partitions, allowing for parallel processing and load balancing.

KRaft

It is used to manage and coordinate the brokers, assisting with leader election for partitions, configuration management, and cluster metadata. In older versions of Kafka, ZooKeeper is utilized to perform these tasks.

Record

A Kafka record, also known as a message, is a unit of data in Kafka. It consists of

  • Key: It is optional and helps determine the partition for a record. Kafka uses a hashing algorithm to map the key to a specific partition, ensuring all records with the same key go to the same partition, maintaining their order.
  • Value: It contains actual data and can by type such as a string, JSON, or binary data, depending on the serialization format used.
  • Header: These are optional key-value pairs that can be included with a record. They provide additional metadata about the record
  • Timestamp: Each record has a timestamp to indicate when the record was published. This timestamp can be set by the producer or assigned by the Kafka broker when the record is received.
Partition Replicas

Each partition in a Kafka topic can have multiple replicas, which are distributed across different brokers in the cluster to ensure fault tolerance and high availability.

Leader

Each partition in a Kafka topic has a single leader. The leader is responsible for handling all read and write requests for that partition. This ensures that all data for a partition is processed in a consistent and orderly manner.

Follower 

Partitions have one or more followers. The Followers replicate the data from the leader using in-sync replicas (ISR) to ensure strong redundancy and durability.

Apache Kafka offers following five core Java APIs to facilitate cluster and client management.

1. Producer API: It allows applications to write stream of records to one or more Kafka topics.

2. Consumer API: It allows applications to read stream of records Kafka topics.

3. Kafka Streams API: It allows applications to read data from input topics, perform transformations, filtering, and aggregation, and then write the results back to output topics.

4. Kafka Connect API: It is used to develop and run reusable data connectors that pull data from external systems and send it to Kafka topics, or vice versa.

5. Admin API: It is used for managing Kafka clusters, brokers, topics and ACLs

Apache Kafka for Developers #1: Introduction to Kafka and Comparison with RabbitMQ


Apache Kafka is an is an open-source, distributed event streaming platform developed by LinkedIn. It is designed to handle real-time data feeds with high throughput and low latency.

It offers high throughput, fault tolerance, resilience, and scalability. It supports a range of use cases, including data integration from various data sources using data connectors, log aggregation, real-time stream processing, website activity tracking, event sourcing and publish-subscribe messaging.

Kafka's architecture is based on a distributed commit log, where data is partitioned and replicated across multiple servers to ensure fault tolerance and scalability. Producers send data to Kafka topics, which are split into partitions, and consumers read data from these partitions.

Key Characteristics of distributed commit log

  1. Append-Only: New records are always appended to the end of the log, ensuring that the order of events is preserved.
  2. Immutable Records: Once a record is written to the log, it cannot be changed or deleted. This immutability guarantees consistency and reliability.
  3. Sequential Reads: Records are read in the order they were written, which simplifies the process of replaying events.
  4. Replication: Data is replicated across multiple nodes to provide fault tolerance. If one node fails, the data can still be accessed from another node.
  5. Scalability: By partitioning the log across multiple nodes, the system can handle large volumes of data with high throughput
Generally, two major messaging models are used to facilitate communication between the multiple applications in decoupled way includes,

Point-to-Point messaging Model:

Messages are stored in a queue, where one or more consumers can access them. However, each message can only be consumed by a single consumer. Once a consumer reads a message, it is removed from the queue.

Publish-Subscribe messaging Model:

Messages are stored in a topic. consumers can subscribe to one or more topics and consume all the messages within those topics.

Kafka's topic partitioned log architecture enables it to support both the Queuing (Point-to-Point) and Publish-Subscribe messaging models.

Kafka Vs RabbitMQ

Kafka

RabbitMQ

It uses a log-based architecture where messages are stored in topics. These topics are divided into partitions to ensure scalability and fault tolerance. Producers send messages to these topics, and consumers read from them at their own pace.

It uses a queue-based architecture where producers send messages to exchanges. These exchanges route the messages to queues based on routing keys, and consumers then read the messages from these queues.

It delivers high throughput and low latency, capable of handling millions of messages per second.

It delivers low latency, capable of handling thousands of messages per second.

It is ideally suited for real-time data processing, event sourcing, log aggregation, website activity tracking and stream processing.

It is ideally suited for task queues, background job processing, communication between applications and complex routing logic.

It doesn’t support publishing messages based on priority order

It supports assigning priorities to messages and consuming them based on the highest priority.

It uses a pull-based model where consumers request messages from specific offsets, enabling message replay and batch processing.

It uses a push-based model, delivering messages to consumers as they arrive.

Messages are stored durably according to the specified retention period.

Messages are removed once they have been consumed by the consumers.

Multiple consumers can subscribe to the same topic in Kafka, as it supports same message can be consumed by different consumers using consumer groups.

Multiple consumers cannot all receive the same message, as messages are deleted once they are consumed.

It uses a binary protocol over TCP.

It uses AMQP, STOMP and MQTT protocols


Apache Kafka for Developers Journey:

Happy Coding :)