Executing code in the background is a very common thing in our industry.
This obviously opens a lot of possibilities when it comes to how to do it, and we’ve historically been using Celery, which allows you to achieve this via Celery tasks.
One thing we also leverage out of this framework is the ability to run these tasks at a specific time, or at least not before it.
This is possible by setting an ETA, and that’s something we rely on alot.
Celery requires you to choose a broker to publish tasks to, and we have been using RabbitMQ v3.13.x, which we needed to upgrade to v4.x.
This upgrade came with the removal of two critical features
The Global QoS meant that all tasks that have ETAs will now block celery workers.
The removal of classic queue mirroring meant that we had to switch to Quorum Queues if we ever wanted to use RabbitMQ’s queue replication.
And since RabbitMQ doesn’t allow you to change a queue’s definition “in-place”, it meant that we have to transfer all the messages that exist in a classic queue to a quorum queue.
To give you an idea about the scale, some environments can peak at 8M messages per day.
However, transferring these messages wasn’t as straightforward as a “copy-paste” kind of operation, because:
- Celery behaves differently when enqueuing tasks with ETAs in Quorum queues as it uses what they call Native Delayed Delivery.
- We had to do this with zero downtime per our SLAs.
Considering all of this, we had to come up with a migration strategy that’ll allow us to move forward cautiously.
In a nutshell, this involved:
- Creating another virtual vhost that will host the new quorum queues
- Choosing migration windows per environment, where traffic of incoming messages would be at its lowest
- Transferring all the messages from classic queues to their “sister” quorum queues in the vhost
- We had to do a special transformation on tasks that had ETAs.
I’ve written a detailed technical breakdown on my blog, covering the full migration strategy, code examples, and lessons learned.
There has been a lot of trial and error of course, but after all the lessons learned during the testing phase we were able to safely and seamlessly migrate environments at a rate of ~340k messages per minute.
Would you like to help us build a better, greener and fairer future?
We're hiring smart people who are passionate about our mission.
- Working with async Django: lessons learned
- Electrifying and flexing low-carbon domestic heating in the home
- Building an AI copilot to make on-call less painful
- Building the largest electric vehicle smart-charging virtual power plant in the world
- How we ship over 100 versions a day to over 25 environments in more than 10 countries
- Avoiding race conditions using MySQL locks
- Estimating cost per dbt model in Databricks
- Automating secrets management with 1Password Connect
- Understanding how mypy follows imports
- Optimizing AWS streams consumer performance
- Sharing power updates using Amazon EventBridge Pipes
- Using formatters and linters to manage a large codebase
- Our pull request conventions
- Patterns of flakey Python tests
- Integrating Asana and GitHub
- Durable database transactions in Django
- Python interfaces a la Golang
- Beware changing the "related name" of a Django model field
- Our in-house coding conventions
- Recommended Django project structure
- Using a custom Sentry client
- Improving accessibility at Octopus Energy
- Django, ELB health checks and continuous delivery
- Organising styles for a React/Django hybrid
- Testing for missing migrations in Django
- Hello world, would you like to join us?