Table of Contents
Incident Response
Procedures and playbooks for handling incidents in the Blueberry IDP.
Incident Severity Levels
Severity | Description | Response Time | Examples |
---|---|---|---|
P1 - Critical | Complete service outage or data loss | < 15 min | API down, cluster failure |
P2 - Major | Significant degradation affecting users | < 30 min | Slow performance, partial outage |
P3 - Minor | Limited impact, workaround available | < 2 hours | Single environment failure |
P4 - Low | No immediate user impact | Next business day | UI glitch, non-critical bug |
Incident Response Process
1. Detect & Declare
- Monitoring alert fires OR user reports issue
- Determine severity level
- Declare incident in #blueberry-incidents
2. Assess & Communicate
🚨 INCIDENT DECLARED 🚨
Severity: P[1-4]
Impact: [What's broken]
Status: Investigating
IC: @[your-name]
Thread: 👇
3. Respond & Resolve
- Follow relevant playbook
- Update status every 15-30 min
- Coordinate in incident thread
4. Document & Learn
- Create postmortem for P1/P2
- Update runbooks if needed
- Share learnings
Key Roles
- Incident Commander (IC): Leads response, coordinates team
- Operations Lead: Executes technical fixes
- Communications Lead: Updates stakeholders (P1/P2 only)
Directory Structure
playbooks/
Specific response procedures by incident type:
- API Outage
- ArgoCD Failure
- Environment Creation Failures
- GKE Cluster Issues
- Authentication Problems
- Performance Degradation
postmortems/
Past incident analyses:
- Template: postmortem-template.md
- Example: 2024-01-api-outage.md
Quick Reference
Emergency Contacts
- On-Call: Check PagerDuty
- Escalation: Escalation Policy
- GCP Support: Support Guide
Critical Commands
# Check cluster status
kubectl get nodes
kubectl get pods --all-namespaces | grep -v Running
# Check ArgoCD
kubectl get applications -n argocd
# Check API logs
kubectl logs -n blueberry deployment/blueberry-api --tail=100
# Force sync ArgoCD app
argocd app sync <app-name> --force
Recovery Procedures
Incident Metrics
Track these KPIs:
- MTTD (Mean Time to Detect): < 5 min
- MTTA (Mean Time to Acknowledge): < 15 min
- MTTR (Mean Time to Resolve): < 2 hours
- Postmortem Completion: Within 48 hours
On-Call Resources
Document ID: workflows/operations/incident-response/README