Celery SQS + Duplication of tasks + SQS visibility timeout

Generally its not a good idea to have tasks with very long ETAs.

First of all, there is the "visibility_timeout" issue. And you probably dont want a very big visibility timeout because if the worker crashes 1 min before the task is about to run, then the Queue will still wait for the visibility_timeout to finish before sending the task to another worker and, I guess you dont want this to be another 1 month.

From celery docs:

Note that Celery will redeliver messages at worker shutdown, so having a long visibility timeout will only delay the redelivery of ‘lost’ tasks in the event of a power failure or forcefully terminated workers.

And also, SQS allows only so many tasks to be in the list to be ack'ed.

SQS calls these tasks as "Inflight Messages". From http://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-visibility-timeout.html:

A message is considered to be in flight after it's received from a queue by a consumer, but not yet deleted from the queue.

For standard queues, there can be a maximum of 120,000 inflight messages per queue. If you reach this limit, Amazon SQS returns the OverLimit error message. To avoid reaching the limit, you should delete messages from the queue after they're processed. You can also increase the number of queues you use to process your messages.

For FIFO queues, there can be a maximum of 20,000 inflight messages per queue. If you reach this limit, Amazon SQS returns no error messages.

I see two possible solutions, you can either use RabbitMQ instead, which doesnt rely on visibility timeouts (there are "RabbitMQ as a service" services if you dont want to manage your own) or change your code to have really small ETAs (best practice)

These are my 2 cents, maybe @asksol can provide some extra insights.

Tags:

Python

Celery