On-Call Guide
Set up kubectl context
Verify access
Test ArgoCD CLI
- During Your Shift
- Common Issues & Quick Fixes
  - 🔥 API Down
Quick restart
Check logs
Check ArgoCD app
Force sync
Delete if needed
Check usage
Increase limits (temporary)
Schedule restart during low traffic
Test Firestore
Check service account
Restart to reconnect
- Communication Guidelines
  - Status Updates
  - Handoff Notes
- End of Shift
  - Handoff Checklist
  - Handoff Template
On-Call Handoff - [Date]

On-Call Guide

Standard operating procedures for Blueberry IDP on-call engineers.

On-Call Responsibilities

Primary On-Call

Response Time: < 15 minutes for P1/P2
Hours: 24/7 coverage for one week
Handoff: Mondays at 10 AM

Backup On-Call

Response Time: < 30 minutes if primary unavailable
Escalation: After 30 min of no primary response

On-Call Schedule

Tool: PagerDuty
Rotation: Weekly, Monday to Monday
Schedule: PagerDuty Schedule

Before Your Shift

1. Access Check

[ ] kubectl access to production cluster
[ ] GCP console access
[ ] ArgoCD UI access
[ ] PagerDuty mobile app installed
[ ] Slack notifications enabled

2. Review Recent Issues

[ ] Read last week's handoff notes
[ ] Review recent incidents
[ ] Check ongoing issues in #blueberry-ops
[ ] Review any scheduled maintenance

3. Environment Setup

# Set up kubectl context
gcloud container clusters get-credentials blueberry-prod --region us-central1

# Verify access
kubectl get nodes
kubectl get pods -n blueberry

# Test ArgoCD CLI
argocd app list

During Your Shift

Daily Tasks

Morning Check (9 AM)
Run daily health check
Review overnight alerts
Check Slack for issues
Afternoon Check (3 PM)
Monitor active environments
Check resource usage
Review any degraded services

Incident Response

P1 - Critical (Service Down)

Acknowledge within 5 minutes
Declare incident in #blueberry-incidents
Follow playbook for specific issue
Update status every 15 minutes
Escalate if not resolved in 30 min

P2 - Major (Service Degraded)

Acknowledge within 15 minutes
Investigate root cause
Communicate in #blueberry-ops
Fix or escalate within 1 hour

P3/P4 - Minor/Low

Acknowledge when convenient
Create ticket for tracking
Fix or delegate to team

Escalation Path

Primary On-Call → You
Backup On-Call → Check schedule
Team Lead → @teamlead
Director → @director
VP Engineering → For P1 > 2 hours

Common Issues & Quick Fixes

🔥 API Down

# Quick restart
kubectl rollout restart deployment/blueberry-api -n blueberry

# Check logs
kubectl logs -n blueberry deployment/blueberry-api --tail=100

Full playbook →

🔥 Environment Stuck

# Check ArgoCD app
argocd app get env-<name>

# Force sync
argocd app sync env-<name> --force

# Delete if needed
kubectl delete application -n argocd env-<name>

🔥 High Memory Usage

# Check usage
kubectl top pods -n blueberry

# Increase limits (temporary)
kubectl set resources deployment/blueberry-api -n blueberry --limits=memory=2Gi

# Schedule restart during low traffic

🔥 Database Connection Issues

# Test Firestore
gcloud firestore operations list

# Check service account
kubectl describe secret -n blueberry google-service-account

# Restart to reconnect
kubectl delete pod -n blueberry -l app=blueberry-api

Communication Guidelines

Status Updates

Use this template:

🔄 UPDATE [HH:MM UTC]
Issue: [Brief description]
Impact: [User impact]
Status: [Investigating/Implementing fix/Monitoring]
Next: [Next action]
ETA: [Resolution estimate]

Handoff Notes

Include:
- Ongoing issues
- Recent incidents
- Scheduled maintenance
- Things to watch
- Any special instructions

End of Shift

Handoff Checklist

[ ] Write handoff notes
[ ] Update on-call calendar
[ ] Close resolved tickets
[ ] Brief incoming on-call
[ ] Update team in #blueberry-ops

Handoff Template

# On-Call Handoff - [Date]

## Summary
- Quiet week / Busy week
- X incidents (P1: X, P2: X)

## Ongoing Issues
- [Issue 1] - [Status]
- [Issue 2] - [Status]

## Recent Changes
- [Deployment on DATE]
- [Config change on DATE]

## Watch Items
- [Thing to monitor]
- [Potential issue]

## Scheduled Maintenance
- [Date/Time] - [Description]

## Notes for Next On-Call
- [Any special instructions]

Tools & Resources

Monitoring

Documentation

Communication

Slack: #blueberry-ops, #blueberry-incidents
PagerDuty: Schedule
Email: oncall@blueberry.example.com

Self-Care

Take breaks when possible
Ask for help when needed
Hand off if you're sick
Document to reduce future stress
Celebrate successful mitigations!

Post-Incident

After any P1/P2 incident:
1. Schedule postmortem within 48 hours
2. Create postmortem doc from template
3. Lead blameless discussion
4. Track action items
5. Share learnings with team

Document ID: guides/best-practices/on-call-guide

Table of Contents