Celery + RabbitMQ common issues
After while running Celery + RabbitMQ in production, you face several issues, which make you learn more about both technologies. In this article, I’ll try to share some os the most common issues and some best practices that can help with those issues.
According Celery Project website…
Celery is an asynchronous task queue/job queue based on distributed message passing. It is focused on real-time operation, but supports scheduling as well.
According Cloud AMQP blog post…
RabbitMQ is a message-queueing software also known as a message broker or queue manager. Simply said; it is software where queues are defined, to which applications connect in order to transfer a message or messages.
The most common issues I’ve experienced were related to broker connection, memory usage and also with long-running tasks consuming too many resources. So, let’s talk about these issues and the proposed solutions…
Connection to broker lost
RabbitMQ uses AMQP as a messaging protocol, which can be defined as an application and network layer protocol that allows client applications to talk to the server and interact. Cloud AMQP has a very detailed article about What is AMQP and why is it used in RabbitMQ? which is a good read recommendation.
Some time ago, I’ve faced several connection errors between Celery workers and the RabbitMQ. Those errors happened in a very random way, without any explicit errors… causing the application to stop consuming messages and things start breaking. The only workaround available was to do a restart on the message broker service to force workers to reopen their channels.
After some time digging into the logs I found the error below, that happens every time the application stops to consuming messages:
consumer: Connection to broker lost. Trying to re-establish the connection...
ConnectionError: Bad frame read
consumer: Cannot connect to amqp://username:**@broker.rabbitmqhost.com:5672//: Error opening socket: a socket error occurred.
Trying again in 2.00 seconds...consumer: Cannot connect to amqp://username:**@broker.rabbitmqhost.com:5672//: Error opening socket: a socket error occurred.
Trying again in 4.00 seconds...consumer: Cannot connect to amqp://username:**@broker.rabbitmqhost.com:5672//: Error opening socket: a socket error occurred.
Trying again in 6.00 seconds...Connected to amqp://username:**@broker.rabbitmqhost.com:5672//
The problem with that is, sometimes the automatic connection retry doesn’t work and the workaround describe previously need to be executed.
Even with a bad workaround described, this gave some time to research more about the issue. Looking into the Celery’s GitHub issues I’ve found some recommendations to replace the
pyamqp as we can see on issue #54. So, after replace the connection string to use pyamqp helps with the issue…
Missed heartbeat messages
During a normal workday, you can face a lot of log messages like below in your production worker environment.
missed heartbeat from celery@worker-01
missed heartbeat from celery@worker-02
missed heartbeat from celery@worker-03
missed heartbeat from celery@worker-04
missed heartbeat from celery@worker-05
missed heartbeat from celery@worker-06
missed heartbeat from celery@worker-07
missed heartbeat from celery@worker-08
Those messages describe that Celery workers weren’t able to get a keepalive message from the broker. When a missed heartbeat happens several times, the worker can try some reconnection using
BROKER_FAILOVER_STRATEGY which can drive us to the previously described issue. Also, it's important to say that celery has his own heartbeat implementation which sends a keepalive message every 2 seconds as described by
BROKER_HEARTBEAT_CHECKRATE , with that, we can face "flood" behavior. This was described in this StackOverflow discussion…
In summary, the defaults Celery’s configuration sends a lot of “unnecessary” messages to the broker. To avoid this flood behavior you can add some parameters like
--without-heartbeat as followed by
--without-mingle on the worker's startup command to decrease the number of messages and avoid unexpected reconnection behaviors.
High memory usage and a large number of queues
Celery worker has the ability to send a message whenever some event happens and these events can be captured by tools like Flower to monitor the cluster.
When you have events enabled and you have a high tasks throughput you’ll have a lot of events messages and if you haven’t expiration rules configured, you can reach a scenario like this:
To avoid this, add two expiration settings, one to thee event queues
CELERY_EVENT_QUEUE_EXPIRES and the second to event messages
CELERY_EVENT_QUEUE_TTL. More information can be found on celery/celery #2387 issue.
Celery’s default configuration isn’t optimal to all environments and you’ll need to make adjustments to give your application workload the best performance and for that, Celery has a very good documentation with some optimizing recommendations, see User Guide — Optimizing