API Outage Playbook
- Symptoms
- Initial Response (First 5 minutes)
  - 1. Confirm the Outage
Check API health
Check pod status
Check recent events
Check logs from crashed pods
Check for resource issues
Check ingress
- Resolution Steps
  - Option 1: Simple Restart (Try First)
Restart the deployment
Watch the rollout
Verify pods are running
Check rollout history
Rollback to previous version
Or rollback to specific revision
Scale down to 0
Clear any PVCs if needed
Scale back up
Temporarily disable auto-sync
Apply emergency fix directly
Re-enable sync after fix
- Verification
Check pods are running
Test health endpoint
Check logs for errors
Test basic functionality
- Common Root Causes
- Communication Updates
- Post-Incident
- Related Documentation

API Outage Playbook

Severity: P1 - Critical
Time to Mitigate: < 30 minutes

Symptoms

Health check endpoint returns non-200 status
Users report "Service Unavailable" errors
Monitoring shows API pods down or crashing

Initial Response (First 5 minutes)

1. Confirm the Outage

# Check API health
curl -I https://api.blueberry.example.com/health

# Check pod status
kubectl get pods -n blueberry | grep api

# Check recent events
kubectl get events -n blueberry --sort-by='.lastTimestamp' | head -20

2. Declare Incident

Post in #blueberry-incidents:

🚨 INCIDENT DECLARED 🚨
Severity: P1
Impact: API is down - all environment operations failing
Status: Investigating
IC: @[your-name]
Thread: 👇

3. Quick Diagnostics

# Check logs from crashed pods
kubectl logs -n blueberry deployment/blueberry-api --previous --tail=100

# Check for resource issues
kubectl describe pod -n blueberry -l app=blueberry-api

# Check ingress
kubectl get ingress -n blueberry

Resolution Steps

Option 1: Simple Restart (Try First)

# Restart the deployment
kubectl rollout restart deployment/blueberry-api -n blueberry

# Watch the rollout
kubectl rollout status deployment/blueberry-api -n blueberry

# Verify pods are running
kubectl get pods -n blueberry -l app=blueberry-api

Option 2: Rollback to Previous Version

# Check rollout history
kubectl rollout history deployment/blueberry-api -n blueberry

# Rollback to previous version
kubectl rollout undo deployment/blueberry-api -n blueberry

# Or rollback to specific revision
kubectl rollout undo deployment/blueberry-api -n blueberry --to-revision=<number>

Option 3: Scale and Debug

# Scale down to 0
kubectl scale deployment/blueberry-api -n blueberry --replicas=0

# Clear any PVCs if needed
kubectl delete pvc -n blueberry -l app=blueberry-api

# Scale back up
kubectl scale deployment/blueberry-api -n blueberry --replicas=3

Option 4: Emergency Direct Fix

If ArgoCD sync is failing:

# Temporarily disable auto-sync
argocd app set blueberry --sync-policy none

# Apply emergency fix directly
kubectl set image deployment/blueberry-api -n blueberry \
  blueberry-api=gcr.io/blueberry-project/blueberry:last-known-good

# Re-enable sync after fix
argocd app set blueberry --sync-policy automated

Verification

After applying fix:

# Check pods are running
kubectl get pods -n blueberry -l app=blueberry-api

# Test health endpoint
curl -s https://api.blueberry.example.com/health | jq .

# Check logs for errors
kubectl logs -n blueberry deployment/blueberry-api --tail=50

# Test basic functionality
curl -H "Authorization: Bearer $TOKEN" \
  https://api.blueberry.example.com/api/environments

Common Root Causes

Cause	Indicators	Fix
OOM Kill	`OOMKilled` in pod status	Increase memory limits
Bad Config	Logs show config errors	Rollback or fix ConfigMap
Database Issue	Connection timeout errors	Check Firestore/Redis
Bad Deploy	Started after recent deploy	Rollback deployment
Resource Limits	`Pending` pods	Scale cluster or reduce replicas

Communication Updates

Every 15 minutes until resolved:

UPDATE: [Time]
Status: [Investigating/Implementing Fix/Monitoring]
Next: [What you're doing next]
ETA: [Best estimate]

Resolution message:

✅ RESOLVED
Duration: [Time]
Root Cause: [Brief description]
Fix: [What fixed it]
Follow-up: Postmortem scheduled for [date/time]

Post-Incident

Update status page to "Operational"
Notify stakeholders of resolution
Schedule postmortem within 24 hours
Create follow-up tickets for:
Permanent fixes
Monitoring improvements
Runbook updates

Document ID: workflows/operations/incident-response/playbooks/api-outage

Table of Contents