API Outage Playbook

Severity: P1 - Critical
Time to Mitigate: < 30 minutes

Symptoms

  • Health check endpoint returns non-200 status
  • Users report "Service Unavailable" errors
  • Monitoring shows API pods down or crashing

Initial Response (First 5 minutes)

1. Confirm the Outage

# Check API health
curl -I https://api.blueberry.example.com/health

# Check pod status
kubectl get pods -n blueberry | grep api

# Check recent events
kubectl get events -n blueberry --sort-by='.lastTimestamp' | head -20

2. Declare Incident

Post in #blueberry-incidents:

🚨 INCIDENT DECLARED 🚨
Severity: P1
Impact: API is down - all environment operations failing
Status: Investigating
IC: @[your-name]
Thread: 👇

3. Quick Diagnostics

# Check logs from crashed pods
kubectl logs -n blueberry deployment/blueberry-api --previous --tail=100

# Check for resource issues
kubectl describe pod -n blueberry -l app=blueberry-api

# Check ingress
kubectl get ingress -n blueberry

Resolution Steps

Option 1: Simple Restart (Try First)

# Restart the deployment
kubectl rollout restart deployment/blueberry-api -n blueberry

# Watch the rollout
kubectl rollout status deployment/blueberry-api -n blueberry

# Verify pods are running
kubectl get pods -n blueberry -l app=blueberry-api

Option 2: Rollback to Previous Version

# Check rollout history
kubectl rollout history deployment/blueberry-api -n blueberry

# Rollback to previous version
kubectl rollout undo deployment/blueberry-api -n blueberry

# Or rollback to specific revision
kubectl rollout undo deployment/blueberry-api -n blueberry --to-revision=<number>

Option 3: Scale and Debug

# Scale down to 0
kubectl scale deployment/blueberry-api -n blueberry --replicas=0

# Clear any PVCs if needed
kubectl delete pvc -n blueberry -l app=blueberry-api

# Scale back up
kubectl scale deployment/blueberry-api -n blueberry --replicas=3

Option 4: Emergency Direct Fix

If ArgoCD sync is failing:

# Temporarily disable auto-sync
argocd app set blueberry --sync-policy none

# Apply emergency fix directly
kubectl set image deployment/blueberry-api -n blueberry \
  blueberry-api=gcr.io/blueberry-project/blueberry:last-known-good

# Re-enable sync after fix
argocd app set blueberry --sync-policy automated

Verification

After applying fix:

# Check pods are running
kubectl get pods -n blueberry -l app=blueberry-api

# Test health endpoint
curl -s https://api.blueberry.example.com/health | jq .

# Check logs for errors
kubectl logs -n blueberry deployment/blueberry-api --tail=50

# Test basic functionality
curl -H "Authorization: Bearer $TOKEN" \
  https://api.blueberry.example.com/api/environments

Common Root Causes

Cause Indicators Fix
OOM Kill OOMKilled in pod status Increase memory limits
Bad Config Logs show config errors Rollback or fix ConfigMap
Database Issue Connection timeout errors Check Firestore/Redis
Bad Deploy Started after recent deploy Rollback deployment
Resource Limits Pending pods Scale cluster or reduce replicas

Communication Updates

Every 15 minutes until resolved:

UPDATE: [Time]
Status: [Investigating/Implementing Fix/Monitoring]
Next: [What you're doing next]
ETA: [Best estimate]

Resolution message:

✅ RESOLVED
Duration: [Time]
Root Cause: [Brief description]
Fix: [What fixed it]
Follow-up: Postmortem scheduled for [date/time]

Post-Incident

  1. Update status page to "Operational"
  2. Notify stakeholders of resolution
  3. Schedule postmortem within 24 hours
  4. Create follow-up tickets for:
  5. Permanent fixes
  6. Monitoring improvements
  7. Runbook updates
Document ID: workflows/operations/incident-response/playbooks/api-outage