On-Call Guide

Standard operating procedures for Blueberry IDP on-call engineers.

On-Call Responsibilities

Primary On-Call

  • Response Time: < 15 minutes for P1/P2
  • Hours: 24/7 coverage for one week
  • Handoff: Mondays at 10 AM

Backup On-Call

  • Response Time: < 30 minutes if primary unavailable
  • Escalation: After 30 min of no primary response

On-Call Schedule

Before Your Shift

1. Access Check

  • [ ] kubectl access to production cluster
  • [ ] GCP console access
  • [ ] ArgoCD UI access
  • [ ] PagerDuty mobile app installed
  • [ ] Slack notifications enabled

2. Review Recent Issues

  • [ ] Read last week's handoff notes
  • [ ] Review recent incidents
  • [ ] Check ongoing issues in #blueberry-ops
  • [ ] Review any scheduled maintenance

3. Environment Setup

# Set up kubectl context
gcloud container clusters get-credentials blueberry-prod --region us-central1

# Verify access
kubectl get nodes
kubectl get pods -n blueberry

# Test ArgoCD CLI
argocd app list

During Your Shift

Daily Tasks

  1. Morning Check (9 AM)
  2. Run daily health check
  3. Review overnight alerts
  4. Check Slack for issues

  5. Afternoon Check (3 PM)

  6. Monitor active environments
  7. Check resource usage
  8. Review any degraded services

Incident Response

P1 - Critical (Service Down)

  1. Acknowledge within 5 minutes
  2. Declare incident in #blueberry-incidents
  3. Follow playbook for specific issue
  4. Update status every 15 minutes
  5. Escalate if not resolved in 30 min

P2 - Major (Service Degraded)

  1. Acknowledge within 15 minutes
  2. Investigate root cause
  3. Communicate in #blueberry-ops
  4. Fix or escalate within 1 hour

P3/P4 - Minor/Low

  1. Acknowledge when convenient
  2. Create ticket for tracking
  3. Fix or delegate to team

Escalation Path

  1. Primary On-Call → You
  2. Backup On-Call → Check schedule
  3. Team Lead → @teamlead
  4. Director → @director
  5. VP Engineering → For P1 > 2 hours

Common Issues & Quick Fixes

🔥 API Down

# Quick restart
kubectl rollout restart deployment/blueberry-api -n blueberry

# Check logs
kubectl logs -n blueberry deployment/blueberry-api --tail=100

Full playbook →

🔥 Environment Stuck

# Check ArgoCD app
argocd app get env-<name>

# Force sync
argocd app sync env-<name> --force

# Delete if needed
kubectl delete application -n argocd env-<name>

🔥 High Memory Usage

# Check usage
kubectl top pods -n blueberry

# Increase limits (temporary)
kubectl set resources deployment/blueberry-api -n blueberry --limits=memory=2Gi

# Schedule restart during low traffic

🔥 Database Connection Issues

# Test Firestore
gcloud firestore operations list

# Check service account
kubectl describe secret -n blueberry google-service-account

# Restart to reconnect
kubectl delete pod -n blueberry -l app=blueberry-api

Communication Guidelines

Status Updates

Use this template:

🔄 UPDATE [HH:MM UTC]
Issue: [Brief description]
Impact: [User impact]
Status: [Investigating/Implementing fix/Monitoring]
Next: [Next action]
ETA: [Resolution estimate]

Handoff Notes

Include:
- Ongoing issues
- Recent incidents
- Scheduled maintenance
- Things to watch
- Any special instructions

End of Shift

Handoff Checklist

  • [ ] Write handoff notes
  • [ ] Update on-call calendar
  • [ ] Close resolved tickets
  • [ ] Brief incoming on-call
  • [ ] Update team in #blueberry-ops

Handoff Template

# On-Call Handoff - [Date]

## Summary
- Quiet week / Busy week
- X incidents (P1: X, P2: X)

## Ongoing Issues
- [Issue 1] - [Status]
- [Issue 2] - [Status]

## Recent Changes
- [Deployment on DATE]
- [Config change on DATE]

## Watch Items
- [Thing to monitor]
- [Potential issue]

## Scheduled Maintenance
- [Date/Time] - [Description]

## Notes for Next On-Call
- [Any special instructions]

Tools & Resources

Monitoring

Documentation

Communication

  • Slack: #blueberry-ops, #blueberry-incidents
  • PagerDuty: Schedule
  • Email: oncall@blueberry.example.com

Self-Care

  • Take breaks when possible
  • Ask for help when needed
  • Hand off if you're sick
  • Document to reduce future stress
  • Celebrate successful mitigations!

Post-Incident

After any P1/P2 incident:
1. Schedule postmortem within 48 hours
2. Create postmortem doc from template
3. Lead blameless discussion
4. Track action items
5. Share learnings with team

Document ID: guides/best-practices/on-call-guide