Celery + RabbitMQ common issues

Best practices to improve your broker infrastructure reliability

After while running Celery + RabbitMQ in production, you face several issues, which make you learn more about both technologies. In this article, I’ll try to share some os the most common issues and some best practices that can help with those issues.

What’s Celery?

According Celery Project website

Celery is an asynchronous task queue/job queue based on distributed message passing. It is focused on real-time operation, but supports scheduling as well.

Celery basic architecture visual
Celery basic architecture visual
Basic Celery architecture visualization by Imaginea

Celery supports several brokers as backends, including RabbitMQ, which is widely used across multiple companies as a message broker solution and is the same described in this article.

What’s RabbitMQ?

According Cloud AMQP blog post

RabbitMQ is a message-queueing software also known as a message broker or queue manager. Simply said; it is software where queues are defined, to which applications connect in order to transfer a message or messages.

RabbitMQ overview architecture by Cloud AMQP blog

Common Issues

The most common issues I’ve experienced were related to broker connection, memory usage and also with long-running tasks consuming too many resources. So, let’s talk about these issues and the proposed solutions…

Connection to broker lost

RabbitMQ uses AMQP as a messaging protocol, which can be defined as an application and network layer protocol that allows client applications to talk to the server and interact. Cloud AMQP has a very detailed article about What is AMQP and why is it used in RabbitMQ? which is a good read recommendation.

Some time ago, I’ve faced several connection errors between Celery workers and the RabbitMQ. Those errors happened in a very random way, without any explicit errors… causing the application to stop consuming messages and things start breaking. The only workaround available was to do a restart on the message broker service to force workers to reopen their channels.

After some time digging into the logs I found the error below, that happens every time the application stops to consuming messages:

consumer: Connection to broker lost. Trying to re-establish the connection...

ConnectionError: Bad frame read
consumer: Cannot connect to amqp://username:**@broker.rabbitmqhost.com:5672//: Error opening socket: a socket error occurred.
Trying again in 2.00 seconds...
consumer: Cannot connect to amqp://username:**@broker.rabbitmqhost.com:5672//: Error opening socket: a socket error occurred.
Trying again in 4.00 seconds...
consumer: Cannot connect to amqp://username:**@broker.rabbitmqhost.com:5672//: Error opening socket: a socket error occurred.
Trying again in 6.00 seconds...
Connected to amqp://username:**@broker.rabbitmqhost.com:5672//

The problem with that is, sometimes the automatic connection retry doesn’t work and the workaround describe previously need to be executed.

Even with a bad workaround described, this gave some time to research more about the issue. Looking into the Celery’s GitHub issues I’ve found some recommendations to replace the librabbitmq by pyamqp as we can see on issue #54. So, after replace the connection string to use pyamqp helps with the issue…

Missed heartbeat messages

During a normal workday, you can face a lot of log messages like below in your production worker environment.

missed heartbeat from celery@worker-01
missed heartbeat from celery@worker-02
missed heartbeat from celery@worker-03
missed heartbeat from celery@worker-04
missed heartbeat from celery@worker-05
missed heartbeat from celery@worker-06
missed heartbeat from celery@worker-07
missed heartbeat from celery@worker-08

Those messages describe that Celery workers weren’t able to get a keepalive message from the broker. When a missed heartbeat happens several times, the worker can try some reconnection using BROKER_FAILOVER_STRATEGY which can drive us to the previously described issue. Also, it's important to say that celery has his own heartbeat implementation which sends a keepalive message every 2 seconds as described by BROKER_HEARTBEAT_CHECKRATE , with that, we can face "flood" behavior. This was described in this StackOverflow discussion

In summary, the defaults Celery’s configuration sends a lot of “unnecessary” messages to the broker. To avoid this flood behavior you can add some parameters like --without-heartbeat as followed by --without-gossip and --without-mingle on the worker's startup command to decrease the number of messages and avoid unexpected reconnection behaviors.

These configurations are recommendations from several articles like CloudAMQP blog post, CloudAMQP doc and also in this issue from the celery repository.

High memory usage and a large number of queues

Celery worker has the ability to send a message whenever some event happens and these events can be captured by tools like Flower to monitor the cluster.

When you have events enabled and you have a high tasks throughput you’ll have a lot of events messages and if you haven’t expiration rules configured, you can reach a scenario like this:

RabbitMQ celery event queue size

To avoid this, add two expiration settings, one to thee event queues CELERY_EVENT_QUEUE_EXPIRES and the second to event messages CELERY_EVENT_QUEUE_TTL. More information can be found on celery/celery #2387 issue.

Tasks time limits

When you have long-running tasks is a good practice to add CELERY_TASK_SOFT_TIME_LIMIT and alsoCELERY_TASK_TIME_LIMIT to protect the broker from long-running tasks being executed forever consuming a lot of resources from your infrastructure.

Conclusion

Celery’s default configuration isn’t optimal to all environments and you’ll need to make adjustments to give your application workload the best performance and for that, Celery has a very good documentation with some optimizing recommendations, see User Guide — Optimizing

Sysadmin trying to be a Site Reliability Engineer @ Loggi