Zero-Downtime Deployments with Kubernetes

Implementing rolling updates and blue-green deployments for seamless production releases.

Why Zero-Downtime Matters

In the early days of software deployment, taking a system down for maintenance was just part of the process. You'd schedule a maintenance window, notify users, take the system offline, deploy the new version, and bring it back up. For critical systems, this meant waking up at 2 AM on a Sunday.

Modern SaaS platforms can't afford downtime. When you're running a global service with users across time zones, there's no "good time" for an outage. Every minute of downtime means lost revenue, frustrated users, and damaged trust. For many businesses, even a few seconds of downtime during peak hours can cost thousands of dollars.

Kubernetes provides powerful primitives for zero-downtime deployments, but they need to be used correctly. In this article, we'll explore how to implement rolling updates, blue-green deployments, and canary releases—all with zero user impact.

Understanding Kubernetes Rolling Updates

Rolling updates are Kubernetes' default deployment strategy. Instead of taking down all pods at once and starting new ones, Kubernetes gradually replaces old pods with new ones, ensuring that your service remains available throughout the deployment.

Here's how it works:

Kubernetes creates a new pod with the updated version
It waits for the new pod to become ready (passes health checks)
Only after the new pod is healthy, it terminates one old pod
This process repeats until all pods are updated

The key parameters that control this behavior are:

spec:
  replicas: 10
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 25%        # Max number of extra pods during update
      maxUnavailable: 0    # Never allow pods to be unavailable

Setting maxUnavailable: 0 is critical for zero-downtime. It ensures that Kubernetes never terminates an old pod until a new one is fully ready to handle traffic.

The Critical Role of Readiness Probes

Rolling updates only work if Kubernetes knows when a pod is actually ready to serve traffic. This is where readiness probes come in. A readiness probe tells Kubernetes whether a pod should receive traffic.

Here's a production-ready readiness probe configuration:

readinessProbe:
  httpGet:
    path: /health/ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5
  timeoutSeconds: 3
  successThreshold: 1
  failureThreshold: 3

Your /health/ready endpoint should verify:

Database connections: Can the app connect to the database?
Critical dependencies: Are Redis, message queues, etc. accessible?
Initialization: Has the app finished loading configuration, warming caches?
Resource availability: Does the app have the resources it needs?

A common mistake is making the readiness probe too simple (just returning 200 OK) or identical to the liveness probe. The readiness probe should be more comprehensive—it's okay for a pod to be alive but not ready to serve traffic.

Handling Database Migrations Safely

Database migrations are one of the trickiest aspects of zero-downtime deployments. You can't just run migrations as part of your application startup because:

Multiple pods might try to run migrations simultaneously
Migrations might not be backward compatible with the old code
Long-running migrations could block deployment

Here's a safe pattern using Kubernetes Jobs:

apiVersion: batch/v1
kind: Job
metadata:
  name: db-migration-v2.5.0
spec:
  backoffLimit: 0  # Don't retry failed migrations
  template:
    spec:
      restartPolicy: Never
      initContainers:
      - name: wait-for-db
        image: postgres:15-alpine
        command: ['sh', '-c', 'until pg_isready -h $DB_HOST -U $DB_USER; do sleep 2; done']
      containers:
      - name: migrate
        image: myapp:v2.5.0
        command: ["./run-migrations.sh"]
        env:
        - name: DB_HOST
          valueFrom:
            secretKeyRef:
              name: db-credentials
              key: host

The workflow:

Run migration as a Kubernetes Job before deploying the new version
Wait for migration to complete successfully
Deploy the new application version
Only after new version is stable, clean up old code paths

For complex migrations, use the expand-contract pattern:

Expand: Add new columns/tables without removing old ones
Deploy: Deploy code that writes to both old and new schema
Migrate: Backfill data from old to new schema
Contract: Remove old columns/tables in a later release

Blue-Green Deployments with Kubernetes

Rolling updates are great, but sometimes you need even more control. Blue-green deployments maintain two complete environments: blue (current production) and green (new version). Traffic is switched instantly from blue to green once the green environment is validated.

In Kubernetes, you implement blue-green with labels and service selectors:

# Blue deployment (current production)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp-blue
spec:
  replicas: 10
  selector:
    matchLabels:
      app: myapp
      version: blue
  template:
    metadata:
      labels:
        app: myapp
        version: blue
    spec:
      containers:
      - name: myapp
        image: myapp:v1.0.0
---
# Green deployment (new version)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp-green
spec:
  replicas: 10
  selector:
    matchLabels:
      app: myapp
      version: green
  template:
    metadata:
      labels:
        app: myapp
        version: green
    spec:
      containers:
      - name: myapp
        image: myapp:v2.0.0
---
# Service points to blue initially
apiVersion: v1
kind: Service
metadata:
  name: myapp
spec:
  selector:
    app: myapp
    version: blue  # Change to 'green' to switch traffic
  ports:
  - port: 80
    targetPort: 8080

The deployment process:

Deploy the green environment alongside blue
Run smoke tests against green (without production traffic)
Switch the service selector from version: blue to version: green
Monitor for issues
If problems occur, instantly rollback by switching back to blue
After validation, tear down the blue environment

The advantage? Instant traffic switch and instant rollback. The disadvantage? You need 2x the resources during deployment.

Canary Releases: Gradual Risk Mitigation

Canary releases combine the best of both worlds. You deploy the new version to a small percentage of users, monitor for issues, and gradually increase traffic if everything looks good.

While Kubernetes doesn't have native canary support, you can implement it with multiple deployments and weighted traffic splitting:

# Stable version (90% of traffic)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp-stable
spec:
  replicas: 9
  selector:
    matchLabels:
      app: myapp
      track: stable
---
# Canary version (10% of traffic)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp-canary
spec:
  replicas: 1
  selector:
    matchLabels:
      app: myapp
      track: canary
---
# Service routes to both
apiVersion: v1
kind: Service
metadata:
  name: myapp
spec:
  selector:
    app: myapp  # Matches both stable and canary
  ports:
  - port: 80
    targetPort: 8080

With 9 stable replicas and 1 canary replica, approximately 10% of traffic goes to the new version. Gradually increase canary replicas while decreasing stable replicas:

Start: 9 stable, 1 canary (10% canary traffic)
After 1 hour: 7 stable, 3 canary (30% canary traffic)
After 4 hours: 5 stable, 5 canary (50% canary traffic)
After 8 hours: 0 stable, 10 canary (100% canary traffic)

For more sophisticated traffic splitting (exact percentages, header-based routing, etc.), use a service mesh like Istio or Linkerd.

Connection Draining and Graceful Shutdown

When Kubernetes terminates a pod, your application needs to handle it gracefully. Without proper shutdown handling, in-flight requests will fail, causing errors for your users.

Kubernetes sends a SIGTERM signal before killing a pod. Your application should:

Stop accepting new requests (fail health checks)
Complete existing requests (with a timeout)
Close database connections cleanly
Flush logs and metrics
Exit with code 0

Here's a Node.js example:

const express = require('express');
const app = express();
const server = app.listen(8080);

let isShuttingDown = false;

// Health check endpoint
app.get('/health/ready', (req, res) => {
  if (isShuttingDown) {
    res.status(503).send('Shutting down');
  } else {
    res.status(200).send('OK');
  }
});

// Graceful shutdown handler
process.on('SIGTERM', () => {
  console.log('SIGTERM received, starting graceful shutdown');
  isShuttingDown = true;

  // Stop accepting new connections
  server.close(() => {
    console.log('HTTP server closed');

    // Close database connections
    db.close().then(() => {
      console.log('Database connections closed');
      process.exit(0);
    });
  });

  // Force shutdown after 30 seconds
  setTimeout(() => {
    console.error('Forced shutdown after timeout');
    process.exit(1);
  }, 30000);
});

Configure Kubernetes to give your app enough time:

spec:
  terminationGracePeriodSeconds: 60  # Give app 60s to shut down
  containers:
  - name: myapp
    lifecycle:
      preStop:
        exec:
          command: ["/bin/sh", "-c", "sleep 5"]  # Let k8s update endpoints

The preStop hook with a 5-second sleep is crucial. It gives Kubernetes time to remove the pod from service endpoints before your app stops accepting connections. Without this, you might still receive traffic after shutdown has started.

Monitoring and Rollback Strategies

Zero-downtime deployment isn't complete without proper monitoring and automated rollback. You need to detect issues quickly and roll back automatically.

Key metrics to monitor during deployment:

Error rate: 5xx responses, application errors
Latency: p50, p95, p99 response times
Throughput: requests per second
Resource usage: CPU, memory, connection pools
Business metrics: conversion rate, checkout success, etc.

Implement automated rollback with Kubernetes rollout status:

#!/bin/bash
# Deploy new version
kubectl apply -f deployment.yaml

# Wait for rollout
if ! kubectl rollout status deployment/myapp --timeout=5m; then
  echo "Deployment failed, rolling back"
  kubectl rollout undo deployment/myapp
  exit 1
fi

# Monitor error rate for 5 minutes
for i in {1..30}; do
  ERROR_RATE=$(curl -s "http://metrics/api/error-rate?service=myapp")
  if (( $(echo "$ERROR_RATE > 1.0" | bc -l) )); then
    echo "Error rate too high ($ERROR_RATE%), rolling back"
    kubectl rollout undo deployment/myapp
    exit 1
  fi
  sleep 10
done

echo "Deployment successful"

Real-World Gotchas and Lessons Learned

1. Load Balancer Connection Draining

If you're using a cloud load balancer (AWS ALB, GCP Load Balancer), configure connection draining. The load balancer needs to stop sending traffic to a pod before Kubernetes terminates it.

service.beta.kubernetes.io/aws-load-balancer-connection-draining-enabled: "true"
service.beta.kubernetes.io/aws-load-balancer-connection-draining-timeout: "60"

2. Session Affinity Issues

If your app uses sticky sessions, rolling updates can break user sessions. Solutions:

Use an external session store (Redis) instead of in-memory sessions
Implement session migration logic
Make sessions optional (graceful degradation)

3. WebSocket Connections

WebSocket connections don't automatically drain. You need to:

Send a close message to clients before shutdown
Implement client-side reconnection logic
Use a longer terminationGracePeriodSeconds (120s+)

4. Distributed Locks and Leader Election

If your app uses leader election (only one pod processes certain tasks), ensure proper handoff during deployment. Use Kubernetes leases or distributed lock implementations that handle node changes gracefully.

Putting It All Together

Zero-downtime deployments with Kubernetes require careful orchestration of multiple components:

Choose your strategy: Rolling update for most cases, blue-green for critical releases, canary for gradual rollouts
Implement proper health checks: Readiness and liveness probes that accurately reflect application state
Handle migrations carefully: Use jobs for migrations, implement expand-contract pattern
Implement graceful shutdown: Handle SIGTERM, drain connections, set appropriate timeouts
Monitor actively: Track error rates, latency, and business metrics during deployment
Automate rollback: Don't rely on manual intervention during incidents

The initial investment in setting up zero-downtime deployments pays off immediately. You can deploy multiple times per day without worrying about user impact, respond to incidents faster, and sleep better at night knowing your deployments won't wake you up with production outages.

Need help implementing zero-downtime deployments for your infrastructure? Let's talk. We help teams design robust deployment pipelines, migrate to Kubernetes, and build reliable production systems.