How we ship over 100 versions a day to over 25 environments in more than 10 countries | Engineering Blog

We deploy over 100 new versions of our core product to production every working day, to more than 25 clients in at least 10 countries.

Here’s how we do it.

Scale

The core of Kraken (our CRM product for utilities to accelerate the clean energy transition) is a monolithic application mainly in a single git repository (mono-repo) with around 9 million lines of Python code (and about 250,000 lines of TypeScript as well). It also contains the documentation, to ensure it stays aligned with the code. We typically deploy over 100 new versions of this a day to over 25 client environments in more than 10 countries, simultaneously.

The number of environments is more than doubled if you include test instances too. However, we ship straight to production.

Once the big green merge button has been pressed on a PR, it is deployed to all client production and test environments at the same time. When things are running smoothly, this takes less than half an hour and no manual intervention is required.

So, how do we minimise the risks of breaking things when moving this fast?

Review and approval

Every PR must be approved by another developer before it can be merged. There is also lots of automated testing (more on this below) and formatting / linting checks that must pass (including for documentation). For a PR, a subset of these checks are run based on what has changed. We also have a system for visually tracking trends over time, such as the number of tests and linting / importing violations.

However, with such a fast moving codebase it is impossible to keep up with the main branch and rebase on the latest head of main. This means that you have to merge an out-of-date branch.

Merge conflicts are blocked but implicit conflicts (where the code can technically merge but ends up broken) are a risk. An implicit conflict could be that a file has been moved and references to it updated, but a PR was branched before this and still has old references.

We have custom tooling to remind developers to rebase their branch if it is too out-of-date and a mechanism to force a rebase if a significant or breaking change has been made in another PR. If your PR is likely to break others (e.g. moving a lot of files around) then you can update a specific “high watermark” file. When merged, this will trigger a recheck of all open PRs and block merging them until they are rebased.

Manual testing

Developers can manually test locally and can claim a test environment to test a branch on if it is a high-risk change. However, at our scale it is not possible to test all branches this way with the number of test environments that we have. So this step is optional and not required to deploy code to production.

Additionally, test environments are not particularly representative of production. At our scale this would be difficult (and expensive!). This limits their usefulness in detecting performance issues in particular, although they can check for functionality and integrations. However, they are very useful for testing things such as moving from an x86 to an ARM processor architecture (for cost and carbon savings) to check that your (AVX) SIMD optimisations still perform well.

Results of local testing are included in PR descriptions, including screenshots or videos for UI changes. New tests are expected for any functionality added, to prove that it works and prevent later regressions.

Automated testing

Once a PR has been merged then the entire test suite is run against it before deployment. Over 100,000 (unit, integration and functional) tests are run on every version and if they pass then it is allowed to go live. This all takes less than half an hour and the version is then packaged up for deployment.

However, sometimes a failing test or broken code makes it into the production CI pipeline (perhaps due to an implicit merge conflict). Some tests may also be flakey and fail only at certain times or when run with other tests. If a failure has made it in to the main branch then we have tooling to alert us so that we can quickly free it up again. Bots post in Slack (tagging the developer) to alert them and others, so that the change can be reverted or a test temporarily skipped.

We also have tooling to allow developers to rebase on the latest commit of the main branch that has passed CI. It automatically pulls the commit hash of the latest good build from an API and rebases on that. This helps keep things moving and prevents holding up development while the blockage is cleared.

You can additionally add an auto-rebase label to your PRs and a custom bot will rebase your PR for you, once production CI is green. It will even comment on the PR to let you know.

Once CI has passed, the version is automatically deployed to all environments that haven’t been claimed for testing or pinned to a specific version. We may manually pin an environment version if a problem is detected with a particular version for some clients or globally (more on this below).

Deployment

Kraken is deployed with Kubernetes (K8s). A new set of pods is created with the new version and these are gradually swapped with the current ones (a rolling update rather than a big bang blue-green deployment). If there is a catastrophic failure with the new version then the K8s pods health-checks will fail and the new set of pods won’t be switched with the old ones. This avoids deploying an application that won’t even start up and run.

Database migrations are also run to add any new columns and tables etc. Great care has to be taken with migrations and some must be performed in multiple steps across different deployments. We have custom bots to assist with this and warn if there are potential issues by commenting on PRs.

The main consideration is that the system does not deploy atomically. Different parts of the system deploy at different times. This is important due to rolling deployments but especially so if there are database migrations. Changes must be backwards and forwards compatible. The way to do this is with lots of tiny changes, one small step at a time. E.g. you have to make a change to the DB separately, before adding code that uses it.

We have custom web applications to monitor the versions on all environments and alert developers once their change has been deployed to the environments that they care about.

Decoupling release from deployment

The biggest lever for reducing risk is to separate deployment and release. Simply because code has been deployed and shipped to an environment does not mean that a feature is released and live.

Feature flags

The main way that we decouple releases from deployments is with feature flags. These let us enable features only on certain environments or for specific users.

For example, we can enable a feature on a test environment and give it a kick. Afterwards, we can enable it in production but only for friendly testers.

There are lots of different ways we can do this but there are two main ones. These are with environment variables and using the database.

We can set the flag in our configuration repository. This gives lots of flexibility around the instances and even the various parts of the application sites that are affected. PR approval is required to make changes to this config and it is picked up on the next release.

We also have DB-backed application settings that can be changed almost instantly. There is a little bit of caching to limit load on the DB but the queries are pretty cheap. These are useful when quicker changes are needed and can be helpful if clients want to self-serve and check things themselves. For example, we may want to experiment with a new form of caching or auth and turn it off quickly if it breaks things. We can base features on “campaign” tags, which operate in a similar way but work at an account level rather than globally. Features can also be based on the active product agreement for a supply point.

Once we are confident in a new feature, ideally the flag is removed and the code tided up. It ends up as if a flag wasn’t used, but in small steps. It’s easy to let unused flags languish, so we have command line tools to check values across all environments and when they were last updated. We’re working on other ways to track “zombie” flags in the codebase so that they can be removed.

Note, feature flags are not the same as configuration. Configuration is permanent (even if the values change) but feature flags should only be transient and ephemeral. They are simply a tool to make releases less risky. They are intended to be temporary and be removed once we are confident things are working. However, this does not always reflect reality!

Experimentation

Another way we test changes in production is with parallel experimentation. This is particularly helpful when testing new implementations with real-world messy data.

We will run the old and new way of doing something in parallel, using the old result but comparing it for differences with the new result. We have an experimentation framework that uses the Python laboratory library to do this. We run this in production and find any differences in our millions of customer accounts if we were to switch to the new system.

For example, this has been used to prevent incorrect data about customer debt being calculated in Kraken when refactoring. The refactored functions passed all tests and review, but since debt positions are such sensitive data about customers, we wanted to be certain that no behaviour would change.

If a difference is found then the new method can be fixed, or maybe we discover the old way had an error and that needs to be corrected.

Incidents

Despite all of the above precautions, incidents are inevitable. When they happen, we respond quickly and fix the issue. We prefer to fix the issue “forward” rather than rolling back to a previous working version. However, as usual, it depends.

We have systems integrated into Slack to manage incidents. We write up incident reports and post-mortems so we learn from mistakes. If the issue is easy to fix and of minor impact then we will fix forward but if everything is broken then we will roll back (unless a DB migration prevents this). If the problematic change has not deployed to all environments yet then we will pin it to pause the rollout.

We have tooling that allows us to pin environments to specific versions if one proves to be problematic. When the issue has been fixed in a new version then the environments are unpinned and normal service resumes.

We don’t schedule deployments out-of-hours to avoid disruption. We want to deploy when people are around to fix things. However, as we have follow-the-sun support, there are rarely times when we are not deploying.

If our Europe or North America teams are asleep then our Asia or Oceania teams can pick things up. We don’t really do change freezes, although we may increase the number of reviewers required to merge a PR over holiday periods.

The important thing to remember is that incidents are not always caused by changes you make. They are often triggered by things outside of your control and you need to react rapidly.

Perhaps a security software vendor ships an untested update globally and crashes all your servers (fortunately we don’t run MS Windows). Or maybe you are targeted by a DDoS attack or other unexpected external event.

Staying agile allows you to respond to incidents. If you are shipping continuously then you can fix things quickly.

Would you like to help us build a better, greener and fairer future?

We're hiring smart people who are passionate about our mission.

Posted by James S Senior Software Engineer on Feb 7, 2025