Kafka as a message queue for long running tasks

Using Kafka as a Job queue for scheduling long running process is not a good idea as Kafka is not a queue in the strictest sense and semantics for failure handling and retries are limited. Though you might be able to achieve a compromise by playing around with certain configuration for rebalance or timeout, it is likely to remain brittle design. Simple answer is that Kafka was not designed for these kind of usecases.

The idea of max.poll.interval.ms is to prevent livelock situation (see), but in your case, consumer will send a false positive to the Kafka broker and will trigger a rebalance as there is no way to distinguish between a livelock and a legitimate long process.

You should think about the tradeoffs between living with the negative consequences you mentioned Vs. introducing a new technology which helps you to model a job queue in a better way. For a more complex usecase, check out how slack is doing it.


The way we got around the issues we were having was as suggested in the comments. We decided to decouple the message processing from the consumer polling.

On each worker/consumer there were 2 threads, one for doing the actual processing and the other for phoning home to Kafka periodically.

We also did some work with trying to reduce the processing times for messages. However some messages still take time that can be measured in minutes. This has worked for us now for some time with no issues.

Thanks for this suggestions in comments @Donal