Incident Response

Procedures and playbooks for handling incidents in the Blueberry IDP.

Incident Severity Levels

Severity	Description	Response Time	Examples
P1 - Critical	Complete service outage or data loss	< 15 min	API down, cluster failure
P2 - Major	Significant degradation affecting users	< 30 min	Slow performance, partial outage
P3 - Minor	Limited impact, workaround available	< 2 hours	Single environment failure
P4 - Low	No immediate user impact	Next business day	UI glitch, non-critical bug

Incident Response Process

1. Detect & Declare

Monitoring alert fires OR user reports issue
Determine severity level
Declare incident in #blueberry-incidents

2. Assess & Communicate

🚨 INCIDENT DECLARED 🚨
Severity: P[1-4]
Impact: [What's broken]
Status: Investigating
IC: @[your-name]
Thread: 👇

3. Respond & Resolve

Follow relevant playbook
Update status every 15-30 min
Coordinate in incident thread

4. Document & Learn

Create postmortem for P1/P2
Update runbooks if needed
Share learnings

Key Roles

Incident Commander (IC): Leads response, coordinates team
Operations Lead: Executes technical fixes
Communications Lead: Updates stakeholders (P1/P2 only)

Directory Structure

playbooks/

Specific response procedures by incident type:
- API Outage
- ArgoCD Failure
- Environment Creation Failures
- GKE Cluster Issues
- Authentication Problems
- Performance Degradation

postmortems/

Past incident analyses:
- Template: postmortem-template.md
- Example: 2024-01-api-outage.md

Quick Reference

Emergency Contacts

On-Call: Check PagerDuty
Escalation: Escalation Policy
GCP Support: Support Guide

Critical Commands

# Check cluster status
kubectl get nodes
kubectl get pods --all-namespaces | grep -v Running

# Check ArgoCD
kubectl get applications -n argocd

# Check API logs
kubectl logs -n blueberry deployment/blueberry-api --tail=100

# Force sync ArgoCD app
argocd app sync <app-name> --force

Recovery Procedures

Incident Metrics

Track these KPIs:
- MTTD (Mean Time to Detect): < 5 min
- MTTA (Mean Time to Acknowledge): < 15 min
- MTTR (Mean Time to Resolve): < 2 hours
- Postmortem Completion: Within 48 hours

On-Call Resources

Document ID: workflows/operations/incident-response/README

Table of Contents