Zero-Downtime Deployments with Kubernetes
Implementing rolling updates and blue-green deployments for seamless production releases.
Why Zero-Downtime Matters
In the early days of software deployment, taking a system down for maintenance was just part of the process. You'd schedule a maintenance window, notify users, take the system offline, deploy the new version, and bring it back up. For critical systems, this meant waking up at 2 AM on a Sunday.
Modern SaaS platforms can't afford downtime. When you're running a global service with users across time zones, there's no "good time" for an outage. Every minute of downtime means lost revenue, frustrated users, and damaged trust. For many businesses, even a few seconds of downtime during peak hours can cost thousands of dollars.
Kubernetes provides powerful primitives for zero-downtime deployments, but they need to be used correctly. In this article, we'll explore how to implement rolling updates, blue-green deployments, and canary releases—all with zero user impact.
Understanding Kubernetes Rolling Updates
Rolling updates are Kubernetes' default deployment strategy. Instead of taking down all pods at once and starting new ones, Kubernetes gradually replaces old pods with new ones, ensuring that your service remains available throughout the deployment.
Here's how it works:
- Kubernetes creates a new pod with the updated version
- It waits for the new pod to become ready (passes health checks)
- Only after the new pod is healthy, it terminates one old pod
- This process repeats until all pods are updated
The key parameters that control this behavior are:
spec:
replicas: 10
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 25% # Max number of extra pods during update
maxUnavailable: 0 # Never allow pods to be unavailable
Setting maxUnavailable: 0 is critical for zero-downtime. It ensures that Kubernetes never terminates an old pod until a new one is fully ready to handle traffic.
The Critical Role of Readiness Probes
Rolling updates only work if Kubernetes knows when a pod is actually ready to serve traffic. This is where readiness probes come in. A readiness probe tells Kubernetes whether a pod should receive traffic.
Here's a production-ready readiness probe configuration:
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 3
successThreshold: 1
failureThreshold: 3
Your /health/ready endpoint should verify:
- Database connections: Can the app connect to the database?
- Critical dependencies: Are Redis, message queues, etc. accessible?
- Initialization: Has the app finished loading configuration, warming caches?
- Resource availability: Does the app have the resources it needs?
A common mistake is making the readiness probe too simple (just returning 200 OK) or identical to the liveness probe. The readiness probe should be more comprehensive—it's okay for a pod to be alive but not ready to serve traffic.
Handling Database Migrations Safely
Database migrations are one of the trickiest aspects of zero-downtime deployments. You can't just run migrations as part of your application startup because:
- Multiple pods might try to run migrations simultaneously
- Migrations might not be backward compatible with the old code
- Long-running migrations could block deployment
Here's a safe pattern using Kubernetes Jobs:
apiVersion: batch/v1
kind: Job
metadata:
name: db-migration-v2.5.0
spec:
backoffLimit: 0 # Don't retry failed migrations
template:
spec:
restartPolicy: Never
initContainers:
- name: wait-for-db
image: postgres:15-alpine
command: ['sh', '-c', 'until pg_isready -h $DB_HOST -U $DB_USER; do sleep 2; done']
containers:
- name: migrate
image: myapp:v2.5.0
command: ["./run-migrations.sh"]
env:
- name: DB_HOST
valueFrom:
secretKeyRef:
name: db-credentials
key: host
The workflow:
- Run migration as a Kubernetes Job before deploying the new version
- Wait for migration to complete successfully
- Deploy the new application version
- Only after new version is stable, clean up old code paths
For complex migrations, use the expand-contract pattern:
- Expand: Add new columns/tables without removing old ones
- Deploy: Deploy code that writes to both old and new schema
- Migrate: Backfill data from old to new schema
- Contract: Remove old columns/tables in a later release
Blue-Green Deployments with Kubernetes
Rolling updates are great, but sometimes you need even more control. Blue-green deployments maintain two complete environments: blue (current production) and green (new version). Traffic is switched instantly from blue to green once the green environment is validated.
In Kubernetes, you implement blue-green with labels and service selectors:
# Blue deployment (current production)
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp-blue
spec:
replicas: 10
selector:
matchLabels:
app: myapp
version: blue
template:
metadata:
labels:
app: myapp
version: blue
spec:
containers:
- name: myapp
image: myapp:v1.0.0
---
# Green deployment (new version)
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp-green
spec:
replicas: 10
selector:
matchLabels:
app: myapp
version: green
template:
metadata:
labels:
app: myapp
version: green
spec:
containers:
- name: myapp
image: myapp:v2.0.0
---
# Service points to blue initially
apiVersion: v1
kind: Service
metadata:
name: myapp
spec:
selector:
app: myapp
version: blue # Change to 'green' to switch traffic
ports:
- port: 80
targetPort: 8080
The deployment process:
- Deploy the green environment alongside blue
- Run smoke tests against green (without production traffic)
- Switch the service selector from
version: bluetoversion: green - Monitor for issues
- If problems occur, instantly rollback by switching back to blue
- After validation, tear down the blue environment
The advantage? Instant traffic switch and instant rollback. The disadvantage? You need 2x the resources during deployment.
Canary Releases: Gradual Risk Mitigation
Canary releases combine the best of both worlds. You deploy the new version to a small percentage of users, monitor for issues, and gradually increase traffic if everything looks good.
While Kubernetes doesn't have native canary support, you can implement it with multiple deployments and weighted traffic splitting:
# Stable version (90% of traffic)
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp-stable
spec:
replicas: 9
selector:
matchLabels:
app: myapp
track: stable
---
# Canary version (10% of traffic)
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp-canary
spec:
replicas: 1
selector:
matchLabels:
app: myapp
track: canary
---
# Service routes to both
apiVersion: v1
kind: Service
metadata:
name: myapp
spec:
selector:
app: myapp # Matches both stable and canary
ports:
- port: 80
targetPort: 8080
With 9 stable replicas and 1 canary replica, approximately 10% of traffic goes to the new version. Gradually increase canary replicas while decreasing stable replicas:
- Start: 9 stable, 1 canary (10% canary traffic)
- After 1 hour: 7 stable, 3 canary (30% canary traffic)
- After 4 hours: 5 stable, 5 canary (50% canary traffic)
- After 8 hours: 0 stable, 10 canary (100% canary traffic)
For more sophisticated traffic splitting (exact percentages, header-based routing, etc.), use a service mesh like Istio or Linkerd.
Connection Draining and Graceful Shutdown
When Kubernetes terminates a pod, your application needs to handle it gracefully. Without proper shutdown handling, in-flight requests will fail, causing errors for your users.
Kubernetes sends a SIGTERM signal before killing a pod. Your application should:
- Stop accepting new requests (fail health checks)
- Complete existing requests (with a timeout)
- Close database connections cleanly
- Flush logs and metrics
- Exit with code 0
Here's a Node.js example:
const express = require('express');
const app = express();
const server = app.listen(8080);
let isShuttingDown = false;
// Health check endpoint
app.get('/health/ready', (req, res) => {
if (isShuttingDown) {
res.status(503).send('Shutting down');
} else {
res.status(200).send('OK');
}
});
// Graceful shutdown handler
process.on('SIGTERM', () => {
console.log('SIGTERM received, starting graceful shutdown');
isShuttingDown = true;
// Stop accepting new connections
server.close(() => {
console.log('HTTP server closed');
// Close database connections
db.close().then(() => {
console.log('Database connections closed');
process.exit(0);
});
});
// Force shutdown after 30 seconds
setTimeout(() => {
console.error('Forced shutdown after timeout');
process.exit(1);
}, 30000);
});
Configure Kubernetes to give your app enough time:
spec:
terminationGracePeriodSeconds: 60 # Give app 60s to shut down
containers:
- name: myapp
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 5"] # Let k8s update endpoints
The preStop hook with a 5-second sleep is crucial. It gives Kubernetes time to remove the pod from service endpoints before your app stops accepting connections. Without this, you might still receive traffic after shutdown has started.
Monitoring and Rollback Strategies
Zero-downtime deployment isn't complete without proper monitoring and automated rollback. You need to detect issues quickly and roll back automatically.
Key metrics to monitor during deployment:
- Error rate: 5xx responses, application errors
- Latency: p50, p95, p99 response times
- Throughput: requests per second
- Resource usage: CPU, memory, connection pools
- Business metrics: conversion rate, checkout success, etc.
Implement automated rollback with Kubernetes rollout status:
#!/bin/bash
# Deploy new version
kubectl apply -f deployment.yaml
# Wait for rollout
if ! kubectl rollout status deployment/myapp --timeout=5m; then
echo "Deployment failed, rolling back"
kubectl rollout undo deployment/myapp
exit 1
fi
# Monitor error rate for 5 minutes
for i in {1..30}; do
ERROR_RATE=$(curl -s "http://metrics/api/error-rate?service=myapp")
if (( $(echo "$ERROR_RATE > 1.0" | bc -l) )); then
echo "Error rate too high ($ERROR_RATE%), rolling back"
kubectl rollout undo deployment/myapp
exit 1
fi
sleep 10
done
echo "Deployment successful"
Real-World Gotchas and Lessons Learned
1. Load Balancer Connection Draining
If you're using a cloud load balancer (AWS ALB, GCP Load Balancer), configure connection draining. The load balancer needs to stop sending traffic to a pod before Kubernetes terminates it.
service.beta.kubernetes.io/aws-load-balancer-connection-draining-enabled: "true"
service.beta.kubernetes.io/aws-load-balancer-connection-draining-timeout: "60"
2. Session Affinity Issues
If your app uses sticky sessions, rolling updates can break user sessions. Solutions:
- Use an external session store (Redis) instead of in-memory sessions
- Implement session migration logic
- Make sessions optional (graceful degradation)
3. WebSocket Connections
WebSocket connections don't automatically drain. You need to:
- Send a close message to clients before shutdown
- Implement client-side reconnection logic
- Use a longer
terminationGracePeriodSeconds(120s+)
4. Distributed Locks and Leader Election
If your app uses leader election (only one pod processes certain tasks), ensure proper handoff during deployment. Use Kubernetes leases or distributed lock implementations that handle node changes gracefully.
Putting It All Together
Zero-downtime deployments with Kubernetes require careful orchestration of multiple components:
- Choose your strategy: Rolling update for most cases, blue-green for critical releases, canary for gradual rollouts
- Implement proper health checks: Readiness and liveness probes that accurately reflect application state
- Handle migrations carefully: Use jobs for migrations, implement expand-contract pattern
- Implement graceful shutdown: Handle SIGTERM, drain connections, set appropriate timeouts
- Monitor actively: Track error rates, latency, and business metrics during deployment
- Automate rollback: Don't rely on manual intervention during incidents
The initial investment in setting up zero-downtime deployments pays off immediately. You can deploy multiple times per day without worrying about user impact, respond to incidents faster, and sleep better at night knowing your deployments won't wake you up with production outages.
Need help implementing zero-downtime deployments for your infrastructure? Let's talk. We help teams design robust deployment pipelines, migrate to Kubernetes, and build reliable production systems.