Table of Contents
On-Call Guide
Standard operating procedures for Blueberry IDP on-call engineers.
On-Call Responsibilities
Primary On-Call
- Response Time: < 15 minutes for P1/P2
- Hours: 24/7 coverage for one week
- Handoff: Mondays at 10 AM
Backup On-Call
- Response Time: < 30 minutes if primary unavailable
- Escalation: After 30 min of no primary response
On-Call Schedule
- Tool: PagerDuty
- Rotation: Weekly, Monday to Monday
- Schedule: PagerDuty Schedule
Before Your Shift
1. Access Check
- [ ] kubectl access to production cluster
- [ ] GCP console access
- [ ] ArgoCD UI access
- [ ] PagerDuty mobile app installed
- [ ] Slack notifications enabled
2. Review Recent Issues
- [ ] Read last week's handoff notes
- [ ] Review recent incidents
- [ ] Check ongoing issues in #blueberry-ops
- [ ] Review any scheduled maintenance
3. Environment Setup
# Set up kubectl context
gcloud container clusters get-credentials blueberry-prod --region us-central1
# Verify access
kubectl get nodes
kubectl get pods -n blueberry
# Test ArgoCD CLI
argocd app list
During Your Shift
Daily Tasks
- Morning Check (9 AM)
- Run daily health check
- Review overnight alerts
-
Check Slack for issues
-
Afternoon Check (3 PM)
- Monitor active environments
- Check resource usage
- Review any degraded services
Incident Response
P1 - Critical (Service Down)
- Acknowledge within 5 minutes
- Declare incident in #blueberry-incidents
- Follow playbook for specific issue
- Update status every 15 minutes
- Escalate if not resolved in 30 min
P2 - Major (Service Degraded)
- Acknowledge within 15 minutes
- Investigate root cause
- Communicate in #blueberry-ops
- Fix or escalate within 1 hour
P3/P4 - Minor/Low
- Acknowledge when convenient
- Create ticket for tracking
- Fix or delegate to team
Escalation Path
- Primary On-Call → You
- Backup On-Call → Check schedule
- Team Lead → @teamlead
- Director → @director
- VP Engineering → For P1 > 2 hours
Common Issues & Quick Fixes
🔥 API Down
# Quick restart
kubectl rollout restart deployment/blueberry-api -n blueberry
# Check logs
kubectl logs -n blueberry deployment/blueberry-api --tail=100
🔥 Environment Stuck
# Check ArgoCD app
argocd app get env-<name>
# Force sync
argocd app sync env-<name> --force
# Delete if needed
kubectl delete application -n argocd env-<name>
🔥 High Memory Usage
# Check usage
kubectl top pods -n blueberry
# Increase limits (temporary)
kubectl set resources deployment/blueberry-api -n blueberry --limits=memory=2Gi
# Schedule restart during low traffic
🔥 Database Connection Issues
# Test Firestore
gcloud firestore operations list
# Check service account
kubectl describe secret -n blueberry google-service-account
# Restart to reconnect
kubectl delete pod -n blueberry -l app=blueberry-api
Communication Guidelines
Status Updates
Use this template:
🔄 UPDATE [HH:MM UTC]
Issue: [Brief description]
Impact: [User impact]
Status: [Investigating/Implementing fix/Monitoring]
Next: [Next action]
ETA: [Resolution estimate]
Handoff Notes
Include:
- Ongoing issues
- Recent incidents
- Scheduled maintenance
- Things to watch
- Any special instructions
End of Shift
Handoff Checklist
- [ ] Write handoff notes
- [ ] Update on-call calendar
- [ ] Close resolved tickets
- [ ] Brief incoming on-call
- [ ] Update team in #blueberry-ops
Handoff Template
# On-Call Handoff - [Date]
## Summary
- Quiet week / Busy week
- X incidents (P1: X, P2: X)
## Ongoing Issues
- [Issue 1] - [Status]
- [Issue 2] - [Status]
## Recent Changes
- [Deployment on DATE]
- [Config change on DATE]
## Watch Items
- [Thing to monitor]
- [Potential issue]
## Scheduled Maintenance
- [Date/Time] - [Description]
## Notes for Next On-Call
- [Any special instructions]
Tools & Resources
Monitoring
Documentation
Communication
- Slack: #blueberry-ops, #blueberry-incidents
- PagerDuty: Schedule
- Email: oncall@blueberry.example.com
Self-Care
- Take breaks when possible
- Ask for help when needed
- Hand off if you're sick
- Document to reduce future stress
- Celebrate successful mitigations!
Post-Incident
After any P1/P2 incident:
1. Schedule postmortem within 48 hours
2. Create postmortem doc from template
3. Lead blameless discussion
4. Track action items
5. Share learnings with team
Document ID: guides/best-practices/on-call-guide