Apache Kafka is an is an open-source, distributed event streaming platform developed by LinkedIn. It is designed to handle real-time data feeds with high throughput and low latency.
It offers high throughput, fault tolerance, resilience, and scalability. It supports a range of use cases, including data integration from various data sources using data connectors, log aggregation, real-time stream processing, website activity tracking, event sourcing and publish-subscribe messaging.
Kafka's architecture is based on a distributed commit log, where data is partitioned and replicated across multiple servers to ensure fault tolerance and scalability. Producers send data to Kafka topics, which are split into partitions, and consumers read data from these partitions.
- Append-Only: New records are always appended to the end of the log, ensuring that the order of events is preserved.
- Immutable Records: Once a record is written to the log, it cannot be changed or deleted. This immutability guarantees consistency and reliability.
- Sequential Reads: Records are read in the order they were written, which simplifies the process of replaying events.
- Replication: Data is replicated across multiple nodes to provide fault tolerance. If one node fails, the data can still be accessed from another node.
- Scalability: By partitioning the log across multiple nodes, the system can handle large volumes of data with high throughput
Messages are stored in a queue, where one or more consumers can access them. However, each message can only be consumed by a single consumer. Once a consumer reads a message, it is removed from the queue.
Publish-Subscribe messaging Model:
Messages are stored in a topic. consumers can subscribe to one or more topics and consume all the messages within those topics.
Kafka's topic partitioned log architecture enables it to support both the Queuing (Point-to-Point) and Publish-Subscribe messaging models.
Kafka |
RabbitMQ |
It uses a log-based architecture where messages are
stored in topics. These topics are divided into partitions to ensure
scalability and fault tolerance. Producers send messages to these topics, and
consumers read from them at their own pace. |
It uses a queue-based architecture where producers send
messages to exchanges. These exchanges route the messages to queues based on
routing keys, and consumers then read the messages from these queues. |
It delivers high throughput and low latency, capable of
handling millions of messages per second. |
It delivers low latency, capable of handling thousands of
messages per second. |
It is ideally suited for real-time data processing, event
sourcing, log aggregation, website activity tracking and stream processing. |
It is ideally suited for task queues, background job
processing, communication between applications and complex routing logic. |
It doesn’t support publishing messages based on priority
order |
It supports assigning priorities to messages and
consuming them based on the highest priority. |
It uses a pull-based model where consumers request
messages from specific offsets, enabling message replay and batch processing. |
It uses a push-based model, delivering messages to
consumers as they arrive. |
Messages are stored durably according to the specified
retention period. |
Messages are removed once they have been consumed by the
consumers. |
Multiple consumers can subscribe to the same topic in
Kafka, as it supports same message can be consumed by different consumers
using consumer groups. |
Multiple consumers cannot all receive the same message,
as messages are deleted once they are consumed. |
It uses a binary protocol over TCP. |
It uses AMQP, STOMP and MQTT protocols |
- Apache Kafka for Developers #1: Introduction to Kafka and Comparison with RabbitMQ
- Apache Kafka for Developers #2: Kafka Architecture and Components
- Apache Kafka for Developers #3: Kafka Topic Replication
- Apache Kafka for Developers #4: Kafka Producer and Acknowledgements
- Apache Kafka for Developers #5: Kafka Consumer and Consumer Group