Daily Health Check Runbook

Last Updated: 2024-01-10
Estimated Time: 15-20 minutes
Risk Level: Low

Prerequisites

  • [ ] kubectl access to GKE cluster
  • [ ] Access to ArgoCD UI
  • [ ] Access to GCP Console
  • [ ] Access to monitoring dashboards

Procedure

Step 1: Check Kubernetes Cluster Health

# Check node status
kubectl get nodes

# Check system pods
kubectl get pods -n kube-system
kubectl get pods -n argocd
kubectl get pods -n blueberry

# Check for any pending or failed pods
kubectl get pods --all-namespaces | grep -E "Pending|Error|CrashLoopBackOff"

Expected Output:
- All nodes should be in "Ready" state
- System pods should be "Running"
- No pods in error states

Step 2: Verify ArgoCD Applications

# Check ArgoCD app status
kubectl get applications -n argocd

# Check for out-of-sync apps
kubectl get applications -n argocd -o json | jq '.items[] | select(.status.sync.status != "Synced") | .metadata.name'

Alternative: Check ArgoCD UI at https://argocd.blueberry.example.com

Step 3: Check Blueberry Application Health

# Check API health endpoint
curl -s https://blueberry.florenciacomuzzi.com/api/health | jq .

# Check recent logs for errors
kubectl logs -n blueberry deployment/blueberry-api --since=1h | grep -i error | tail -20

Step 4: Review Resource Usage

# Check cluster resource usage
kubectl top nodes

# Check namespace resource usage
kubectl top pods -n blueberry
kubectl top pods -n argocd

Step 5: Check Environment Status

# List active environments
kubectl get namespaces | grep "pr-"

# Check for stale environments (older than 7 days)
python scripts/check-stale-environments.py

Step 6: Verify External Services

# Check Firestore connectivity
gcloud firestore operations list --limit=5

# Check Redis connectivity
kubectl exec -n blueberry deployment/blueberry-api -- redis-cli ping

Verification

  • [ ] All nodes are healthy
  • [ ] No pods in error state
  • [ ] ArgoCD apps are synced
  • [ ] API health check returns 200
  • [ ] Resource usage is below 80%
  • [ ] No stale environments present

Alerts to Check

  • Check Slack #blueberry-alerts channel
  • Review any overnight PagerDuty incidents
  • Check GCP monitoring alerts

Troubleshooting

Issue Solution
Node NotReady Check GKE console, may need to cordon and drain
Pods in CrashLoop Check logs: kubectl logs <pod> -p
ArgoCD out of sync Manual sync or check for Git webhook issues
High resource usage Scale cluster or clean up unused resources
API health check fails Check pod logs and ingress configuration

Escalation

If any critical issues found:
1. Post in #blueberry-ops Slack channel
2. Create incident if service is degraded
3. Page on-call if service is down

Document ID: workflows/operations/runbooks/daily/health-check