Celery never retries your tasks, unless you tell it to. Here is how you can tell her to:

The Basics

There are a few basic terms and procedures that you need to know about Celery, to understand this article. All of them can be found here:

Celery: A few gotchas explained

All available settings

Here is a list of everything you can set and unset to change the retry-behavior of Celery.

All the retry-related settings:

  • : the number of seconds to wait before the next retry.
  • : the number of times to retry, the default is 3. If this is set to 5, then the task will run up to 6 times: the first time + 5 retries.
  • : a list/tuple of exceptions. When one of these exceptions occurs, the task will be retried. If a different exception occurs, the task will NOT be retried.
  • : either or an integer. When this is set, it makes sure that the interval between retries is increasing after every attempt. The number represents the number of seconds to wait until the next retry, the backoff is calculated like this .
  • : the max value for the number of minutes to wait between retries. This exists to curb the exponential growth of .
  • : is or . It is used only together with . If this is set to , then the number of seconds to wait between retries will be randomly chosen on the interval between 0 and the number that produces: .

How it works

By default, Celery executes each task only once. Retrying is fully supported, but it has to be manually triggered ( ) or explicitly setup.

1. A retry is triggered.

A new ETA is calculated. The ETA can be explicitly provided in the retry command: or or it will be calculated from the settings of the task. After this, the task is assigned to the same worker as has executed the task last time and is then put into the ETA queue. This queue is not really a queue, but more an ordered list of tasks with ETA. Each worker has a list like that (if there are any ETA tasks, of course). In the broker, this list is stored under the key .

To inspect this queue in Redis, call:

redis-cli HGETAL unacked

2. The ETA time arrives

The task will still need to wait for its worker to have free processes/threads to execute it. This means that there can sometimes be a big difference between the ETA time and the actual execution start. You might want to keep an eye on that difference. The way we did it is that we:

  • Hook into the signal to set the execution start time. This signal is rung just before the task is executed.
 from celery import signals, Task
from datetime import datetime, timezone

@signals.task_prerun.connect
def task_prerun(task: Task, *args, **kwargs):
"""Dispatched before a task is executed. Sender is the task object being executed."""
setattr(task.request, EXEC_START, datetime.now(tz=timezone.utc).isoformat())
  • Hook into the signal to send the diff between ETA and the EXEC_START time to :
   from celery import signals, Task
from datetime import datetime, timezone
from dateutil.parser import isoparse

@signals.task_postrun.connect
def task_postrun(task: Task, state, *args, **kwargs):

exec_start_str = getattr(task.request, EXEC_START, None)
eta_str = headers.get("eta")
if exec_start_str and eta_str:
exec_start = isoparse(getattr(task.request, EXEC_START, None))
eta = isoparse(eta_str)

diff_ms = int(exec_start - eta).total_seconds() * 1000)

logger.info(f"Diff between execution time and ETA is {diff_ms} ms for {task}.")
statsd.timing(f"celery.task.{key}", value, tags=[f"task:{task.name}"])

3. A worker finds a free process.

The worker starts executing the task and either finishes it with the status (no exception and no retry occur) or (an exception occurs, but no retry) or a new retry.

4. The max retries is reached.

If a retry is triggered, but it turns out that there are no more retries. The task’s status is set to and is raised.

Common settings

Explicit manual retry

This task will only retry if that -statement’s body is reached. At that point the task will stop, the behaves like a statement. An ETA will be created for seconds into the future. The task will be run up to 4 times: the first time + 3 retries.

@celery_app.task(default_retry_delay=30, max_retries=3)
def generate_and_send_email(user_id: int, event_name: str):
...
if something_went_wrong_temporarily:
generate_and_send_email.retry()

Retry on Exception

This task will be retried whenever either the exception or occurs (or any of its subclasses). We could just say and the task would retry for EVERY exception. But my experience showed me that most exceptions are not worth a task retry. Most of the time the exception that occurs will not go away with a new retry. Think of exceptions like because the code is trying to do . We can re-try many times, but as long as the task’s code or the input don’t change, the task will end with a .

After a retry is triggered, the new ETA will be set to seconds into the future. The task will be run up to 6 times: the first time + 5 retries.

@celery_app.task(autoretry_for=(EmailServerNotReachable, SomeEmailException, ),default_retry_delay=45,  max_retries=5)
def generate_and_send_email(user_id: int, event_name: str):
...

Retry with backoff

This scenario is especially common when we are dealing with an HTTP request in a task. We want to make the HTTP request again because we weren’t able to get data from the remote service, but we don’t want to overwhelm the remote service. To make the task user friendly, we might want to retry the HTTP request fairly soon: let’s say in 30 seconds. But if the 2nd attempt fails as well, then something bigger is probably wrong. We want to wait a bit before we ping them again: let’s say 60 seconds. If this attempt fails again, then let’s wait even longer: maybe 2 minutes. And so on. This is called an “exponential backoff”.

To set it up, you can do this:

@celery_app.task(
autoretry_for=(EmailServerNotReachable, SomeEmailException, ),
retry_backoff=30, # <- wait 30s before the 1st retry
max_retries=10,
retry_jitter=False,
)
def generate_and_send_email(user_id: int, event_name: str):
...

This task will now run up to 11 times and will exponentially increase the interval. The interval is calculated like this:

retry_backoff∗2retry=30s∗2retry

After the first fail, the new ETA will be set to 30s into the future. The second time it will be 60s, the third it will be 2 mins, then 4 mins, then 8 mins and so on. I would like to emphasize again that this is an exponential backoff. It seems to me that exponential functions are so unintuitive for humans that they get us by surprise no matter how much we think we understand them.

The numbers start pretty harmless, but then they grow suddenly and crazily: . The last number: is equal to . Here is a graph of these intervals. Attention, the numbers are in minutes, not seconds, because we can visualize them better.

The green series shows how the intervals are increasing to pretty big numbers. Between retry 7 and 8, the task waits for about half an hour (=32mins=1920s). Before retries 8 and 9 it waits for about 1 hour (=64mins=3840s).

But the purple series is the real shocker. The purple series shows us how long after the task was started, is the task still running (or waiting to be run). Several hours after we started this one-off task, that task might still be running. If the task will run all 11 times, then it will only be marked as succeeded or failed after 8.5h. Or in other words, if we trigger it at around lunch, let’s say at 12:00. It will still be in the queue 6h later, at 18:00. It will only be done at 20:30 in the evening. Or in yet other words, if you are looking at the queue now, the workers might be idle, but the queue is filled with tasks that were triggered many, many hours ago. Hopefully, the users are not waiting for those results. Hopefully, you don’t have other things lined up to happen after the task is done, like… sending out a Slack notification about the event.

Image for post
Image for post
Created with https://www.canva.com/

I think most developers don’t make this calculation most of the time. And why would they, if we had stopped this task after 5 retries, the task would have completely finished after about 15 minutes + the execution time. And that seems reasonable. And very much not dangerous. But you know, a simple change of to gets us into trouble.

To save us, there is .

Always retry with

To avoid retry-backoff spinning out of control, use . This example limits the backoff-interval to 5 min. The intervals will go like this: .

@celery_app.task(
autoretry_for=(EmailServerNotReachable, SomeEmailException, ),
retry_backoff=30,
max_retries=10,
retry_jitter=False,
retry_backoff_max=5*60,
)
def generate_and_send_email(user_id: int, event_name: str):
...

This completely changes the graph. Even if this task runs all 11 times, it is reasonable to expect it will be finished in less than 40mins (+ execution time).

Image for post
Image for post
Created with https://www.canva.com/

Retrying with jitter

This setting is useful for a very specific type of situation: when you have several clients starting your task at the same time. This task has an HTTP request and the task has to be retried. If all of your clients started the task at the same time, then they will all try to retry it at the same time, which is not optimal for your resources. This is where we introduce . By setting , we allow randomness into the calculation of the next ETA. The next ETA won’t be the exact number as calculated by , instead, it will be a random number of the interval between 0 and the calculated number: .

@celery_app.task(
autoretry_for=(EmailServerNotReachable, SomeEmailException, ),
retry_backoff=30,
max_retries=10,
retry_jitter=False,
retry_backoff_max=5*60,
retry_jitter=True,
)
def generate_and_send_email(user_id: int, event_name: str):
...

Here is a graph with the min(=0) and max(=the result of exponential backoff) “num of minutes” in light red and 2 examples of what the actual “num of minutes” will be.

Image for post
Image for post
Created with https://www.canva.com/

The jitter is only used together with exponential backoff, it has no effect next to .

And here is the source code for calculating the next ETA: https://github.com/celery/celery/blob/8c5e9888ae10288ae1b2113bdce6a4a41c47354b/celery/utils/time.py#L374-L388.

Software Developer by profession, Explorer by mind. The more I know, the more I understand. http://www.ines-panker.com/

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store