Django, Kubernetes Health Checks and Continuous Delivery

This post is written from the perspective of Kraken Customer, our main customer management platform.

Back in 2016, we wrote about how we deploy Django applications using ELB health checks and continuous delivery. That system worked well for years — immutable infrastructure, fail-fast health checks, zero-downtime deployments. The core ideas were solid.

But as we grew, the cracks started to show. Deployments slowed as we added more environments. Configuration management became a liability. Long-running migrations started causing capacity gaps. Jobs got killed mid-execution during deploys. The tools that got us to 20 environments weren’t going to scale to what we needed next.

So we spent three years migrating to Kubernetes, while moving towards a GitOps approach. The migration finished a couple of years ago now, and we’ve been running on this stack ever since.

Our deployments are pretty complex, and Kubernetes covered our use cases well. This post is about the problems we needed to solve, and how we solved them.

The problems we needed to solve

We had to make a lot of technology decisions along the way — Docker vs other container runtimes, ArgoCD vs Flux, Mozilla SOPS vs Vault vs Secrets Manager, CNI built-in EKS vs Cilium. Nobody in the company had previous experience with these tools, so we had to learn and evaluate them as we went. The upskilling was significant, but those decisions were answers to questions posed by real problems we were hitting:

Deployment speed hitting a wall: Terraform deployments had limited concurrency. As we added more environments, the whole deployment system got slower and slower. With 20+ environments, this was becoming painful. Our Terraform Enterprise account costs were also climbing, to a point that we could hardly ignore.

Monolithic worker architecture: All Celery workers ran on the same set of EC2 instances, processing all queues. We couldn’t scale billing workers separately from messaging workers, or give smart meter data processing workers more resources without over-provisioning everything else.

Configuration with no safety net: Consul had no audit logs. Bad data in Consul could crash an entire environment, and we had no way to track who changed what or when.

Capacity gaps during deployments: HTTP workers had health checks that kept old instances running during migrations, but Celery workers didn’t. Long migrations meant zero Celery capacity.

Jobs tied to deployment cycles: Scheduled tasks and management commands ran on Celery worker instances. When we deployed, those instances got terminated — including any long-running jobs.

Infrastructure too bespoke: Our EC2 infrastructure was custom-built for the Kraken Django codebase. We wanted a more generic runner architecture that we could reuse across old and new products. This meant we needed proper network segmentation — different parts of the runner should only have access to specific parts of the network.

The migration came with its own challenges: networking, data integrity, shifting traffic progressively between old and new infrastructure. That’s probably a good subject for another post.

Deployment and orchestration

The pipeline: From Terraform to GitOps

Before:

CircleCI → Packer (AMI) → Terraform → AWS Auto Scaling Group

CircleCI ran tests, Packer built an AMI, and Terraform applied changes directly to AWS. This worked, but it meant our CI system was directly scheduling multiple Terraform pipelines. Rollbacks were awkward — you could manually set the AMI ID in Terraform variables and trigger a run, but that required special permissions to operate Terraform Cloud and knowledge of a tool many engineers weren’t familiar with. Our state files were also big, which did not help.

The bigger problem: Terraform’s concurrency limits combined with our deployment strategy. We deployed each PR in order — not just the latest, but every single merge. Each deployment took 10 to 15 minutes to roll out all the new EC2 instances. As the number of PRs merged increased, deployment queues grew long. Deploying to all environments became a serialized bottleneck. We’d sometimes have to manually skip ahead in the queue if we needed a particular change to go live urgently.

Now:

CircleCI → Docker (ECR) → GitOps repo → ArgoCD → Kubernetes

CircleCI builds a Docker image and pushes it to ECR, then updates a YAML file in our GitOps repository with the new image tag. Each environment has its own ArgoCD instance watching its own section of the repo. They all deploy independently and in parallel — deployment time is now O(1) relative to the number of environments.

We also abandoned the deploy-every-PR-in-order approach. Now we only deploy the latest version — if multiple PRs merge while a deployment is running, we skip straight to the most recent.

This separation turns out to be really useful:

  • Audit trail: Every deployment is a Git commit. You can see exactly what changed and when.
  • Easy rollback: git revert the commit, ArgoCD does the rest. No need to re-run pipelines.
  • Separation of concerns: CI builds and tests. CD deploys. They don’t need to know about each other.
  • Self-healing: If someone manually changes something in the cluster, ArgoCD notices and reverts it.
  • Parallel deployments: Each environment deploys independently. Adding more environments doesn’t slow anything down.

Rolling updates with zero downtime

Before, we used blue-green deployments: launch a new ASG, wait for instances to become healthy, switch the load balancer, terminate the old ASG.

Now, Kubernetes handles rolling updates:

strategy:
  rollingUpdate:
    maxUnavailable: 0
    maxSurge: 50%
  • Old pods stay running until new pods report Ready status
  • If new pods fail health checks, rollout pauses automatically
  • Graceful shutdown with terminationGracePeriodSeconds and preStop hooks ensures in-flight requests complete

The maxUnavailable: 0 guarantee means traffic is never dropped — old pods keep serving while new ones come up.

Application health and safety

Configuration: From Consul to ConfigMaps

Before: Multiple Consul servers (one per environment), with access credentials shared via 1Password vaults. The features we relied on were also being deprecated.

Now: Configuration comes from ConfigMaps and Secrets:

envFrom:
  - configMapRef:
      name: myapp-environment
  - secretRef:
      name: myapp-secrets

The same principle remains — Django settings raise ImproperlyConfigured if required values are missing. But now:

  • Config is versioned in Git (via the GitOps repo) with full audit history
  • Changes trigger pod recreation (checksum annotations ensure pods restart when config changes)
  • Validation happens at application initialization (during Helm hooks or deployment rotation) — bad config prevents pods from starting
  • No more shared credentials for config access

Health checks: Three probes, one endpoint

Before, ELB hit /health/ and checked for HTTP 200. The view checked migrations had applied, rendered test pages, and validated the application was working.

Kubernetes provides three distinct probe types:

Probe Question Failure Response
startupProbe “Has the app finished initialising?” Keep waiting (don’t restart yet)
livenessProbe “Is the container still alive?” Restart the container
readinessProbe “Should this pod receive traffic?” Remove from Service endpoints

In the current setup, we still have a single /health/ endpoint, now used by all probe types, that:

  • On first run: validates critical subsystems (ORM connectivity, cache, Celery broker, uWSGI workers accepting connections)
  • After first success: always succeeds

A pod that reports ready will keep reporting that — the startup check is the critical gate.

There’s also an implicit health signal we get for free: if all uWSGI workers are busy, the health endpoint can’t respond in time. The readinessProbe fails first (it has a lower threshold), and the pod gets removed from the load balancer. This gives overloaded pods a chance to catch up — they stop receiving new traffic while they drain the request buffers in Nginx (we run one Nginx reverse proxy per pod, which buffers requests before forwarding to uWSGI workers).

What happens when things go wrong

Kubernetes provides clear, predictable responses to different failure modes:

Failure Kubernetes Response
Missing required env var Container crash → CrashLoopBackOff
ORM/Cache/Celery unavailable startupProbe fails → Pod never Ready
Django hangs livenessProbe fails → Container restarted
Slow responses readinessProbe fails → Traffic routed elsewhere
Bad deployment Rolling update pauses, old pods remain
Migration fails Helm hook fails → ArgoCD blocks deployment

Unlike Consul, where bad configuration could bring down an environment, Kubernetes prevents misconfigured pods from ever receiving traffic.

Data and state management

Database migrations: From capacity gaps to guaranteed ordering

The old problem: All workers attempted to acquire a distributed lock (stored in Memcache) on startup. The first worker to acquire the lock ran migrations; others waited.

HTTP deployments had a migration health check gate — new instances wouldn’t become healthy until migrations completed, so the old ASG stayed up. But Celery workers had no such gate. Their ASGs would scale down the old workers while the new ones were blocked waiting for the lock — resulting in zero Celery capacity during long migrations.

Back then, we didn’t have large database tables or heavy database traffic. Migrations were normally quick — the window of zero Celery capacity was brief and rarely noticed.

With a much bigger team and a lot more database traffic, long-running migrations are now common. The old system just wouldn’t cope — we’d regularly have no Celery capacity during deployments.

The solution: Migrations are now Kubernetes Jobs with Helm hook annotations:

apiVersion: batch/v1
kind: Job
metadata:
  name: migration-runner
  annotations:
    "helm.sh/hook": pre-install,pre-upgrade
    "helm.sh/hook-weight": "-5"

The pre-install,pre-upgrade annotation means this Job runs before any deployment changes. With Helm pre-upgrade hooks, nothing gets touched until migrations finish. Only then does Kubernetes start the rolling update — and this applies to everything, not just HTTP workers.

Issue Before (During ASG Transition) Now (Helm Hooks)
When do migrations run? During ASG scale-down/up Before any deployment changes
Old HTTP pods during migration Stayed up (had health gate) Unchanged — still serving traffic
Old Celery pods during migration Scaled to zero → no capacity Unchanged — still processing tasks

Note: we still use the distributed lock inside the migration command — we’ve had instances of ArgoCD starting a new set of Helm hooks while the previous ones were still executing.

Jobs that survive deployments

Before: Management commands and scheduled tasks ran on Celery worker EC2 instances. The most significant problem: jobs could not outlive a deployment cycle. When a new ASG started deploying, the old instances were terminated — including any long-running jobs mid-execution.

This meant all jobs had to complete within the typical CI/CD cycle time. Long-running tasks (data migrations, report generation, batch processing) were either impossible or had to be carefully scheduled around deployments.

Now: CronJobs and Jobs are first-class Kubernetes resources with their own lifecycle:

  • Survive deployments: Job pods are independent of Deployments — a release can roll out while a job keeps running
  • Resource isolation: Jobs get their own CPU/memory allocation
  • No impact on task processing: Scheduled independently from Celery workers

Long-running jobs are now possible without worrying about deployment timing.

Scaling and efficiency

The problem: I/O-bound workloads don’t show up in CPU metrics

Auto Scaling Groups scaled on CPU utilization. For many workloads this makes sense — if CPU is high, you need more capacity. When we first moved to Kubernetes, we tried using Horizontal Pod Autoscalers (HPAs) with CPU-based scaling. HPAs are built into Kubernetes, they’re well-understood, and they work well for CPU-bound workloads.

But our Django stack is synchronous and I/O bound. Our workers spend most of their time waiting — on database queries, on external API calls, on cache lookups. While a worker waits on I/O, it consumes almost no CPU.

This created a visibility problem:

  • High load → many workers blocked on I/O → low CPU
  • ASG sees low CPU → doesn’t scale up
  • Meanwhile, all workers are occupied and requests are queuing

Scaling pods: KEDA on application metrics

We had to throw away the HPAs and adopt KEDA.

KEDA (Kubernetes Event-Driven Autoscaler) wasn’t part of our original plan. We added it after realising we needed to scale based on metrics that actually reflect demand. We looked at how other companies had solved this problem and borrowed their ideas:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: api-scaler
spec:
  minReplicaCount: 10
  maxReplicaCount: 240
  triggers:
  # How many connections are waiting to be served?
  - type: prometheus
    metadata:
      threshold: "250"
      query: sum(nginx_connections_waiting{app="api"})

  # What proportion of workers are occupied (even if just waiting on I/O)?
  - type: prometheus
    metadata:
      threshold: "0.70"
      query: avg(uwsgi_worker_busy{app="api"})

  # Pre-scale for predictable traffic patterns
  - type: cron
    metadata:
      timezone: Europe/London
      start: 0 8 * * *
      end: 0 17 * * *
      desiredReplicas: "20"

Why these metrics work for I/O-bound workloads:

Metric What It Measures
nginx_connections_waiting Requests queuing because no worker is free to handle them
uwsgi_worker_busy Workers occupied — whether doing CPU work or waiting on I/O
RabbitMQ queue depth Messages waiting because Celery workers are all busy

A worker blocked on a 2-second database query shows as “busy” in uWSGI stats, even though CPU is near zero. That’s the metric we need.

Worker type distribution: Another big improvement is that we now run separate Deployments for different worker types — billing workers target billing queues, messaging workers target messaging queues, and so on. Before, all workers ran on the same EC2 instances processing all queues. Now we can scale each type independently and allocate resources based on their needs. Billing workers get more memory, payment workers get priority scheduling, incoming calls live data workers can scale independently during peak demand.

Scaling nodes: Karpenter for dynamic provisioning

Each Auto Scaling Group was configured with a specific EC2 instance type. Changing instance types meant updating Terraform and redeploying the ASG. All instances in an ASG were identical.

When we moved to Kubernetes, we initially used Cluster Autoscaler. But it uses ASGs under the hood — so we inherited the same problems: fixed instance types and slow to start new nodes.

Once it became clear these issues were hurting us — both in terms of how fast we could scale and how much we were spending — we switched to Karpenter. Instead of pre-defined ASGs with fixed instance types, Karpenter:

  • Right-sizes nodes on demand: It looks at what pods are waiting to be scheduled and spins up appropriately-sized EC2 instances
  • Mixes instance types: It can pick from a range of instance families based on what’s available and what’s cheapest
  • Starts faster: It talks directly to EC2 rather than going through ASG scaling
  • Packs workloads efficiently: When demand drops, it consolidates pods onto fewer nodes

Later, we also enabled spot instance support — workloads can run on spot instances for cost savings, with automatic fallback to on-demand when spot capacity is unavailable.

What we learned

The underlying philosophy hasn’t changed: fail fast, fail safely, and never touch running infrastructure. But Kubernetes and GitOps give us much better control over how we do that.

Some problems were invisible when we were smaller — like the Celery capacity gaps during migrations. Others became impossible to ignore — like Terraform’s concurrency limits. It was a long journey, but it solved problems that were either holding us back or about to.

Principle Before Now
Load balancing ELB ALB
Compute orchestration EC2 Auto Scaling Groups Kubernetes Deployments + Karpenter
Node instance types Fixed per ASG Dynamic, any size/type
Spot instances Not supported Supported alongside on-demand
Immutable infrastructure AMIs Docker images
Config management Consul ConfigMaps/Secrets
Config validation Runtime only ArgoCD validates before apply
Health checks ELB health check startup/liveness/readiness probes
Migrations Lock + wait (HTTP only) Helm pre-upgrade hook (all workloads)
Scaling pods ASG on CPU (poor for I/O-bound) KEDA on application metrics
Worker distribution Monolithic (all types together) Separate Deployments per type
Scaling nodes ASG then Cluster Autoscaler Karpenter
Jobs On Celery workers, killed on deploy Dedicated pods, survive deployments
Deployment speed Serialized (Terraform) Parallel (ArgoCD per environment)
Deployment trigger CircleCI to Terraform Git commit to ArgoCD
Rollback Re-run pipeline git revert

Would you like to help us build a better, greener and fairer future?

We're hiring smart people who are passionate about our mission.

Posted by Federico Marani Staff Platform Engineer on Feb 11, 2026