Table of Contents
Daily Health Check Runbook
Last Updated: 2024-01-10
Estimated Time: 15-20 minutes
Risk Level: Low
Prerequisites
- [ ] kubectl access to GKE cluster
- [ ] Access to ArgoCD UI
- [ ] Access to GCP Console
- [ ] Access to monitoring dashboards
Procedure
Step 1: Check Kubernetes Cluster Health
# Check node status
kubectl get nodes
# Check system pods
kubectl get pods -n kube-system
kubectl get pods -n argocd
kubectl get pods -n blueberry
# Check for any pending or failed pods
kubectl get pods --all-namespaces | grep -E "Pending|Error|CrashLoopBackOff"
Expected Output:
- All nodes should be in "Ready" state
- System pods should be "Running"
- No pods in error states
Step 2: Verify ArgoCD Applications
# Check ArgoCD app status
kubectl get applications -n argocd
# Check for out-of-sync apps
kubectl get applications -n argocd -o json | jq '.items[] | select(.status.sync.status != "Synced") | .metadata.name'
Alternative: Check ArgoCD UI at https://argocd.blueberry.example.com
Step 3: Check Blueberry Application Health
# Check API health endpoint
curl -s https://blueberry.florenciacomuzzi.com/api/health | jq .
# Check recent logs for errors
kubectl logs -n blueberry deployment/blueberry-api --since=1h | grep -i error | tail -20
Step 4: Review Resource Usage
# Check cluster resource usage
kubectl top nodes
# Check namespace resource usage
kubectl top pods -n blueberry
kubectl top pods -n argocd
Step 5: Check Environment Status
# List active environments
kubectl get namespaces | grep "pr-"
# Check for stale environments (older than 7 days)
python scripts/check-stale-environments.py
Step 6: Verify External Services
# Check Firestore connectivity
gcloud firestore operations list --limit=5
# Check Redis connectivity
kubectl exec -n blueberry deployment/blueberry-api -- redis-cli ping
Verification
- [ ] All nodes are healthy
- [ ] No pods in error state
- [ ] ArgoCD apps are synced
- [ ] API health check returns 200
- [ ] Resource usage is below 80%
- [ ] No stale environments present
Alerts to Check
- Check Slack #blueberry-alerts channel
- Review any overnight PagerDuty incidents
- Check GCP monitoring alerts
Troubleshooting
Issue | Solution |
---|---|
Node NotReady | Check GKE console, may need to cordon and drain |
Pods in CrashLoop | Check logs: kubectl logs <pod> -p |
ArgoCD out of sync | Manual sync or check for Git webhook issues |
High resource usage | Scale cluster or clean up unused resources |
API health check fails | Check pod logs and ingress configuration |
Escalation
If any critical issues found:
1. Post in #blueberry-ops Slack channel
2. Create incident if service is degraded
3. Page on-call if service is down
Related Documentation
Document ID: workflows/operations/runbooks/daily/health-check