In Kafka, message transfer among the producer, broker, and consumers is done by making use of standardized binary message format. The process of converting the data into a stream of bytes for the purpose of the transmission is known as Serialization. Deserialization is the process of converting the bytes of arrays into the desired data format. Custom serializers are used at the producer end to let the producer know how to convert the message into byte arrays. To convert the byte arrays back into the message, deserializers are used at the consumer end.
Posted Date:- 2021-11-12 14:20:44
No, Kafka does not currently support reducing the number of partitions for a topic. The partitions can be increased, but not decreased.
Posted Date:- 2021-11-12 14:20:05
The retention time can be configured in Kafka for a topic. The default retention time for a topic is 7 days. The retention time can be configured while a new topic is set up. Log.retention.hours is the property of a broker which is used to set the retention time when a topic is created. However, when configurations have to be changed for a currently running topic, kafka-topic.sh will have to be used.
The correct command depends on the version of Kafka that is in use.
Up to 0.8.2 kafka-topics.sh --alter is the command to be used.
from 0.9.0 going forward, use kafka-configs.sh --alter
Posted Date:- 2021-11-12 14:19:28
OutOfMemoryException can occur if the consumers are sending large messages or if there is a spike in the number of messages wherein the consumer is sending messages at a rate faster than the rate of downstream processing. This causes the message queue to fill up, taking up memory.
Posted Date:- 2021-11-12 14:18:39
BufferExhaustedException is thrown when the producer cannot allocate memory to a record due to the buffer being too full. The exception is thrown if the producer is in non-blocking mode and the rate of production exceeds the rate at which data is sent from the buffer for long enough for the allocated buffer to be exhausted.
Posted Date:- 2021-11-12 14:17:30
The Kafka broker does not keep a tab of which messages have been read by the consumers. It simply keeps all of the messages in its queue for a fixed period of time, known as the retention time, after which the messages are deleted. It is the responsibility of the consumer to pull the messages from the queue. Hence, Kafka is said to have a “smart-client, dumn-broker†architecture.
Posted Date:- 2021-11-12 14:16:50
Currently, Kafka does not allow you to reduce the number of partitions for a topic. The partitions can be expanded but not shrunk. The alter command in Apache Kafka allows you to change the behavior of a topic and its associated configurations. To add extra partitions, use the alter command.
To increase the number of partitions to five, use the following command:
./bin/kafka-topics.sh --alter --zookeeper localhost:2181 --topic sample-topic --partitions 5
Posted Date:- 2021-11-12 14:15:38
Log compaction is a method by which Kafka ensures that at least the last known value for each message key within the log of data is retained for a single topic partition. This makes it possible to restore state after an application crashes, or in cases of a system failure. It allows reloading caches once an application restarts during any operational maintenance. Log compaction guarantees that any consumer processing the log from the start can see at least the final state of all records in the order that they were written.
Posted Date:- 2021-11-12 14:14:45
The Apache cluster will automatically identify any broker shutdown or failure. In this instance, new leaders for partitions previously handled by that device will be chosen. This can happen as a result of a server failure or even if it is shut down for maintenance or configuration changes. When a server is taken down on purpose, Kafka provides a graceful method for terminating the server rather than killing it.
When a server is switched off:
* To prevent having to undertake any log recovery when Kafka is restarted, it ensures that all of its logs are synced onto a disk. Because log recovery takes time, purposeful restarts can be sped up.
* Prior to shutting down, all partitions for which the server is the leader will be moved to the replicas. The leadership transfer will be faster as a result, and the period each partition is inaccessible will be decreased to a few milliseconds.
Posted Date:- 2021-11-12 14:14:05
Yes, if the number of partitions is greater than the number of consumers in a consumer group, then a consumer will have to read more than one partition from a topic.
Posted Date:- 2021-11-12 14:12:38
Log compaction is a way through which Kafka assures that for each topic partition, at least the last known value for each message key within the log of data is kept. This allows for the restoration of state following an application crash or a system failure. During any operational maintenance, it allows refreshing caches after an application restarts. Any consumer processing the log from the beginning will be able to see at least the final state of all records in the order in which they were written, because of the log compaction.
Posted Date:- 2021-11-12 14:08:34
Producers transmit data to brokers in JSON format in Kafka. The JSON format stores data in string form, which can result in several duplicate records being stored in the Kafka topic. As a result, the amount of disc space used increases. As a result, before delivering messages to Kafka, compression or delaying of data is performed to save disk space. Because message compression is performed on the producer side, no changes to the consumer or broker setup are required.
It is advantageous because of the following factors:
It decreases the latency of messages transmitted to Kafka by reducing their size.
Producers can send more net messages to the broker with less bandwidth.
When data is saved in Kafka using cloud platforms, it can save money in circumstances where cloud services are paid.
Message compression reduces the amount of data stored on disk, allowing for faster read and write operations.
Message Compression has the following disadvantages :
Producers must use some CPU cycles to compress their work.
Decompression takes up several CPU cycles for consumers.
Compression and decompression place a higher burden on the CPU.
Posted Date:- 2021-11-12 14:07:54
* Producers end up using some CPU cycles for compression.
* Consumers use some CPU cycles for decompression.
* Compression and decompression result in greater CPU demand.
Posted Date:- 2021-11-12 14:04:54
A smart producer/dumb broker is a broker that does not attempt to track which messages have been read by consumers. It only retains unread messages.
Posted Date:- 2021-11-12 14:04:07
An acknowledgement or ack is sent to the producer by a broker to acknowledge receipt of the message. Ack level defines the number of acknowledgements that the producer requires before considering a request complete.
Posted Date:- 2021-11-12 14:03:37
Kafka ecosystem is a bit difficult to configure, and one needs implementation knowledge. It does not fit in situations where there is a lack of monitoring tool, and a wildcard option is not available to select topics.
Posted Date:- 2021-11-12 14:03:05
This is one of the most asked advanced Kafka interview questions. Kafka can be deployed as a multi-tenant solution. The configuration for different topics on which data is to be consumed or produced is enabled.
Posted Date:- 2021-11-12 14:02:11
If the consumer is not located in the same data center as the broker, it requires tuning the socket buffer size to amortize the long network latency.
Posted Date:- 2021-11-12 13:57:29
For the purpose of stronger durability and higher availability, replication tool is available here. Its types are −
* Create Topic Tool
* List Topic Tool
* Add Partition Tool
Posted Date:- 2021-11-12 13:56:59
In Kafka, replication provides fault tolerance, by ensuring that published messages are not permanently lost. Even if they are lost on one node due to program error, machine error, or even due to software upgrades, then there is a replica present on another node that can be recovered.
Posted Date:- 2021-11-12 13:55:02
When a consumer wants to join a group, it sends a JoinGroup request to the group coordinator. The first consumer to join the group becomes the group leader. The leader receives a list of all consumers in the group from the group coordinator and is responsible for assigning a subset of partitions to each consumer. It uses an implementation of PartitionAssignor to decide which partitions should be handled by which consumer.
After deciding on the partition assignment, the consumer group leader sends the list of assignments to the Group Coordinator, which sends this information to all the consumers. Each consumer only sees his own assignment—the leader is the only client process that has the full list of consumers in the group and their assignments. This process repeats every time a rebalance happens.
Posted Date:- 2021-11-12 13:53:56
Kafka, being a distributed publish–subscribe system, has the following advantages:
* Fast: Kafka comprises a broker, and a single broker can serve thousands of clients by handling megabytes of reads and writes per second.
* Scalable: Data is partitioned and streamlined over a cluster of machines to enable large information.
* Durable: Messages are persistent and is replicated in the cluster to prevent record loss.
* Distributed by design: It provides fault-tolerance and robustness.
Posted Date:- 2021-11-12 13:53:24
Every time message or record is assigned to a partition in Kafka, it is provided with an offset. The offset denotes the position of the record in that partition. A record can be uniquely identified within a partition using the offset value. The partition offset only carries meaning within that particular partition. Records are always added to the ends of partitions and therefore, older records will have a lower offset.
Posted Date:- 2021-11-12 13:52:46
It is enabled by default and starts the pool of cleaner threads. For enabling log cleaning on particular topic, add: log.cleanup.policy=compact. This can be done either by using alter topic command or at topic creation time.
Posted Date:- 2021-11-12 13:52:22
Apache Flume is a dependable, distributed, and available software for aggregating, collecting, and transporting massive amounts of log data quickly and efficiently. Its architecture is versatile and simple, based on streaming data flows. It's written in the Java programming language. It features its own query processing engine, allowing it to alter each fresh batch of data before sending it to its intended sink. It is designed to be adaptable.
Posted Date:- 2021-11-12 13:51:47
When a customer adds new disks or nodes to existing nodes, partitions are not automatically balanced. If several nodes in a topic are already equal to the replication factor, adding disks will not help in rebalancing. Instead, the Kafka-reassign-partitions command is recommended after adding new hosts.
Posted Date:- 2021-11-12 09:28:50
This is not possible from a class behaving as a producer because, like in most queue systems, its role is to forget and fire the messages. As a message consumer, you get the offset from a Kaka broker.
Posted Date:- 2021-11-12 09:28:22
The maximum size of the message that Kafka server can receive is 1000000 bytes.
Posted Date:- 2021-11-12 09:27:45
The QueueFullException occurs when the producer sends messages to the broker at a pace that the broker cannot handle. A solution here is to add more brokers to handle the pace of messages coming in from the producer.
Posted Date:- 2021-11-12 09:27:10
Messages sent to Kafka are retained irrespective of whether they are published or not for a fixed period of time that is referred to as the retention period. The retention period can be configured for a topic. The default retention time is 7 days.
Posted Date:- 2021-11-12 09:26:47
Simply said, this means that the Follower cannot acquire data at the same rate as the Leader.
Posted Date:- 2021-11-12 09:25:42
Kafka is not explicitly developed for Hadoop. Using it for writing and reading data is trickier than it is with Flume. However, Kafka is a highly reliable and scalable system used to connect multiple systems like Hadoop.
Posted Date:- 2021-11-12 09:25:08
Load balancing in Kafka is handled by the producers. The message load is spread out between the various partitions while maintaining the order of the message. By default, the producer selects the next partition to take up message data using a round-robin approach. If a different approach other than round-robin is to be used, users can also specify exact partitions for a message.
Posted Date:- 2021-11-12 09:24:34
It is not possible to bypass the ZooKeeper in Kafka and connect directly to the Apache Server. Hence, the answer is no. If for any reason, the ZooKeeper is down, it will not be possible to service any client requests.
Posted Date:- 2021-11-12 09:24:08
So, ways to tune Apache Kafka it is to tune its several components:
1. Tuning Kafka Producers
2. Kafka Brokers Tuning
3. Tuning Kafka Consumers
Posted Date:- 2021-11-12 09:23:29
As we know, messages are retained for a considerable amount of time in Kafka. Moreover, there is flexibility for consumers that they can read as per their convenience.
Although, there is a possible case that if Kafka is configured to keep messages for 24 hours and possibly that time consumer is down for time greater than 24 hours, then the consumer may lose those messages.
However, still, we can read those messages from last known offset, but only at a condition that the downtime on part of the consumer is just 60 minutes. Moreover, on what consumers are reading from a topic Kafka doesn’t keep state.
Posted Date:- 2021-11-12 09:22:34
We view log as the partitions. Basically, a data source writes messages to the log. One of the advantages is, at any time one or more consumers read from the log they select.
Here, below diagram shows a log is being written by the data source and the log is being read by consumers at different offsets.
Posted Date:- 2021-11-12 09:22:11
Let’s compare Traditional queuing systems vs Apache Kafka feature-wise:
* Messages Retaining
Traditional queuing systems– It deletes the messages just after processing completion typically from the end of the queue.
Apache Kafka– But in Kafka, messages persist even after being processed. That implies messages in Kafka don’t get removed as consumers receive them.
* Logic-based processing
Traditional queuing systems–Traditional queuing systems don’t permit to process logic based on similar messages or events.
Apache Kafka– Kafka permits to process logic based on similar messages or events.
Posted Date:- 2021-11-12 09:21:13
You cannot do that from a class that behaves as a producer like in most queue systems, its role is to fire and forget the messages. The broker will do the rest of the work like appropriate metadata handling with id’s, offsets, etc.
As a consumer of the message, you can get the offset from a Kafka broker. If you gaze in the SimpleConsumer class, you will notice it fetches MultiFetchResponse objects that include offsets as a list. In addition to that, when you iterate the Kafka Message, you will have MessageAndOffset objects that include both, the offset and the message sent.
Posted Date:- 2021-11-12 09:20:30
One of the Apache Kafka’s alternative is RabbitMQ. So, let’s compare both:
i. Features
Apache Kafka– Kafka is distributed, durable and highly available, here the data is shared as well as replicated.
RabbitMQ– There are no such features in RabbitMQ.
ii. Performance rate
Apache Kafka– To the tune of 100,000 messages/second.
RabbitMQ- In case of RabbitMQ, the performance rate is around 20,000 messages/second.
Posted Date:- 2021-11-12 09:20:09
The MirrorMaker is a standalone utility for copying data from one Apache Kafka cluster to another. The MirrorMaker reads data from original cluster topics and writes it to a destination cluster with the same topic name. The source and destination clusters are separate entities that can have various partition counts and offset values.
Posted Date:- 2021-11-12 09:19:49
An API which permits an application to subscribe to one or more topics and also to process the stream of records produced to them is what we call Consumer API.
Posted Date:- 2021-11-12 09:19:16
We can easily deploy Kafka as a multi-tenant solution. However, by configuring which topics can produce or consume data, Multi-tenancy is enabled. Also, it provides operations support for quotas.
Posted Date:- 2021-11-12 09:19:03
For our cluster, Kafka MirrorMaker offers geo-replication. Basically, messages are replicated across multiple data centers or cloud regions, with MirrorMaker.
So, it can be used in active/passive scenarios for backup and recovery; or also to place data closer to our users, or support data locality requirements.
Posted Date:- 2021-11-12 09:18:47
If the preferred replica is not in the ISR, the controller will fail to move leadership to the preferred replica.
Posted Date:- 2021-11-12 09:18:31
If a replica remains out of ISR for an extended time, it indicates that the follower is unable to fetch data as fast as data accumulated at the leader.
Posted Date:- 2021-11-12 09:18:03
Replication of message in Kafka ensures that any published message does not lose and can be consumed in case of machine error, program error or more common software upgrades.
Posted Date:- 2021-11-12 09:17:43
During data, production to get exactly once messaging from Kafka you have to follow two things avoiding duplicates during data consumption and avoiding duplication during data production.
Here are the two ways to get exactly one semantics while data production:
• Avail a single writer per partition, every time you get a network error checks the last message in that partition to see if your last write succeeded
• In the message include a primary key (UUID or something) and de-duplicate on the consumer
Posted Date:- 2021-11-12 09:17:30
The main difference between Kafka and Flume are:
Types of tool
Apache Kafka– As Kafka is a general-purpose tool for both multiple producers and consumers.
Apache Flume– Whereas, Flume is considered as a special-purpose tool for specific applications.
Replication feature
Apache Kafka– Kafka can replicate the events.
Apache Flume- wThe main difference between Kafka and Flume are:
Types of tool
Apache Kafka– As Kafka is a general-purpose tool for both multiple producers and consumers.
Apache Flume– Whereas, Flume is considered as a special-purpose tool for specific applications.
Replication feature
Apache Kafka– Kafka can replicate the events.
Apache Flume- whereas, Flume does not replicate the events.hereas, Flume does not replicate the events.
Posted Date:- 2021-11-12 09:16:50
Two major measurements are taken into account while tuning for optimal performance: latency measures, which relate to the amount of time it takes to process one event, and throughput measures, which refer to the number of events that can be processed in a given length of time. Most systems are tuned for one of two things: delay or throughput, whereas Kafka can do both.
The following stages are involved in optimizing Kafka's performance:
>> Kafka producer tuning: Data that producers must provide to brokers is kept in a batch. The producer transmits the batch to the broker when it's ready. To adjust the producers for latency and throughput, two parameters must be considered: batch size and linger time. The batch size must be chosen with great care. If the producer is constantly delivering messages, a bigger batch size is recommended to maximize throughput. However, if the batch size is set to a huge value, it may never fill up or take a long time to do so, affecting the latency. The batch size must be selected based on the nature of the volume of messages transmitted by the producer. The linger duration is included to create a delay while more records are added to the batch, allowing for larger records to be transmitted. More messages can be transmitted in one batch with a longer linger period, but latency may suffer as a result. A shorter linger time, on the other hand, will result in fewer messages being transmitted faster, resulting in lower latency but also lower throughput.
>> Tuning the Kafka broker: Each partition in a topic has a leader, and each leader has 0 or more followers. It's critical that the leaders are appropriately balanced, and that some nodes aren't overworked in comparison to others.
Tuning Kafka Consumers: To ensure that consumers keep up with producers, the number of partitions for a topic should be equal to the number of consumers. The divisions are divided among the consumers in the same consumer group.
Posted Date:- 2021-11-12 09:16:29