Incident Response

Procedures and playbooks for handling incidents in the Blueberry IDP.

Incident Severity Levels

Severity Description Response Time Examples
P1 - Critical Complete service outage or data loss < 15 min API down, cluster failure
P2 - Major Significant degradation affecting users < 30 min Slow performance, partial outage
P3 - Minor Limited impact, workaround available < 2 hours Single environment failure
P4 - Low No immediate user impact Next business day UI glitch, non-critical bug

Incident Response Process

1. Detect & Declare

  • Monitoring alert fires OR user reports issue
  • Determine severity level
  • Declare incident in #blueberry-incidents

2. Assess & Communicate

🚨 INCIDENT DECLARED 🚨
Severity: P[1-4]
Impact: [What's broken]
Status: Investigating
IC: @[your-name]
Thread: 👇

3. Respond & Resolve

  • Follow relevant playbook
  • Update status every 15-30 min
  • Coordinate in incident thread

4. Document & Learn

  • Create postmortem for P1/P2
  • Update runbooks if needed
  • Share learnings

Key Roles

  • Incident Commander (IC): Leads response, coordinates team
  • Operations Lead: Executes technical fixes
  • Communications Lead: Updates stakeholders (P1/P2 only)

Directory Structure

playbooks/

Specific response procedures by incident type:
- API Outage
- ArgoCD Failure
- Environment Creation Failures
- GKE Cluster Issues
- Authentication Problems
- Performance Degradation

postmortems/

Past incident analyses:
- Template: postmortem-template.md
- Example: 2024-01-api-outage.md

Quick Reference

Emergency Contacts

Critical Commands

# Check cluster status
kubectl get nodes
kubectl get pods --all-namespaces | grep -v Running

# Check ArgoCD
kubectl get applications -n argocd

# Check API logs
kubectl logs -n blueberry deployment/blueberry-api --tail=100

# Force sync ArgoCD app
argocd app sync <app-name> --force

Recovery Procedures

Incident Metrics

Track these KPIs:
- MTTD (Mean Time to Detect): < 5 min
- MTTA (Mean Time to Acknowledge): < 15 min
- MTTR (Mean Time to Resolve): < 2 hours
- Postmortem Completion: Within 48 hours

On-Call Resources

Document ID: workflows/operations/incident-response/README