Table of Contents
API Outage Playbook
Severity: P1 - Critical
Time to Mitigate: < 30 minutes
Symptoms
- Health check endpoint returns non-200 status
- Users report "Service Unavailable" errors
- Monitoring shows API pods down or crashing
Initial Response (First 5 minutes)
1. Confirm the Outage
# Check API health
curl -I https://api.blueberry.example.com/health
# Check pod status
kubectl get pods -n blueberry | grep api
# Check recent events
kubectl get events -n blueberry --sort-by='.lastTimestamp' | head -20
2. Declare Incident
Post in #blueberry-incidents:
🚨 INCIDENT DECLARED 🚨
Severity: P1
Impact: API is down - all environment operations failing
Status: Investigating
IC: @[your-name]
Thread: 👇
3. Quick Diagnostics
# Check logs from crashed pods
kubectl logs -n blueberry deployment/blueberry-api --previous --tail=100
# Check for resource issues
kubectl describe pod -n blueberry -l app=blueberry-api
# Check ingress
kubectl get ingress -n blueberry
Resolution Steps
Option 1: Simple Restart (Try First)
# Restart the deployment
kubectl rollout restart deployment/blueberry-api -n blueberry
# Watch the rollout
kubectl rollout status deployment/blueberry-api -n blueberry
# Verify pods are running
kubectl get pods -n blueberry -l app=blueberry-api
Option 2: Rollback to Previous Version
# Check rollout history
kubectl rollout history deployment/blueberry-api -n blueberry
# Rollback to previous version
kubectl rollout undo deployment/blueberry-api -n blueberry
# Or rollback to specific revision
kubectl rollout undo deployment/blueberry-api -n blueberry --to-revision=<number>
Option 3: Scale and Debug
# Scale down to 0
kubectl scale deployment/blueberry-api -n blueberry --replicas=0
# Clear any PVCs if needed
kubectl delete pvc -n blueberry -l app=blueberry-api
# Scale back up
kubectl scale deployment/blueberry-api -n blueberry --replicas=3
Option 4: Emergency Direct Fix
If ArgoCD sync is failing:
# Temporarily disable auto-sync
argocd app set blueberry --sync-policy none
# Apply emergency fix directly
kubectl set image deployment/blueberry-api -n blueberry \
blueberry-api=gcr.io/blueberry-project/blueberry:last-known-good
# Re-enable sync after fix
argocd app set blueberry --sync-policy automated
Verification
After applying fix:
# Check pods are running
kubectl get pods -n blueberry -l app=blueberry-api
# Test health endpoint
curl -s https://api.blueberry.example.com/health | jq .
# Check logs for errors
kubectl logs -n blueberry deployment/blueberry-api --tail=50
# Test basic functionality
curl -H "Authorization: Bearer $TOKEN" \
https://api.blueberry.example.com/api/environments
Common Root Causes
Cause | Indicators | Fix |
---|---|---|
OOM Kill | OOMKilled in pod status |
Increase memory limits |
Bad Config | Logs show config errors | Rollback or fix ConfigMap |
Database Issue | Connection timeout errors | Check Firestore/Redis |
Bad Deploy | Started after recent deploy | Rollback deployment |
Resource Limits | Pending pods |
Scale cluster or reduce replicas |
Communication Updates
Every 15 minutes until resolved:
UPDATE: [Time]
Status: [Investigating/Implementing Fix/Monitoring]
Next: [What you're doing next]
ETA: [Best estimate]
Resolution message:
✅ RESOLVED
Duration: [Time]
Root Cause: [Brief description]
Fix: [What fixed it]
Follow-up: Postmortem scheduled for [date/time]
Post-Incident
- Update status page to "Operational"
- Notify stakeholders of resolution
- Schedule postmortem within 24 hours
- Create follow-up tickets for:
- Permanent fixes
- Monitoring improvements
- Runbook updates
Related Documentation
Document ID: workflows/operations/incident-response/playbooks/api-outage