Table of Contents
- Troubleshooting Guide
- Overall cluster health
- Blueberry components
- Recent errors
- Check if namespace exists
- Check ArgoCD application
- Check API logs
Troubleshooting Guide
Quick reference for diagnosing and resolving common issues in the Blueberry IDP.
Quick Diagnostics
System Health Check
# Overall cluster health
kubectl get nodes
kubectl get pods --all-namespaces | grep -v Running
# Blueberry components
kubectl get all -n blueberry
kubectl get all -n argocd
# Recent errors
kubectl get events --all-namespaces --sort-by='.lastTimestamp' | grep Warning
Common Issues
🔴 Environment Creation Failures
Symptoms:
- Environment stuck in "Creating" state
- No namespace created
- ArgoCD app missing
Quick Checks:
# Check if namespace exists
kubectl get ns | grep <env-name>
# Check ArgoCD application
kubectl get application -n argocd env-<env-name>
# Check API logs
kubectl logs -n blueberry deployment/blueberry-api | grep <env-name>
Common Causes & Fixes:
1. GitLab sync issues → Check webhook, manually sync ArgoCD
2. Resource quotas → Check namespace quotas, clean up old envs
3. Invalid config → Validate ConfigSet values
🔴 API Performance Issues
Symptoms:
- Slow response times
- Timeouts on API calls
- High latency
Quick Checks:
# Check pod resources
kubectl top pods -n blueberry
# Check Redis
kubectl exec -n blueberry deployment/blueberry-api -- redis-cli ping
# Check Firestore latency
# (Check GCP Console metrics)
🔴 Authentication Failures
Symptoms:
- 401/403 errors
- "Invalid token" messages
- Login loops
Quick Checks:
# Check Firebase config
kubectl get configmap -n blueberry blueberry-config -o yaml | grep FIREBASE
# Test token validation
curl -H "Authorization: Bearer <token>" https://api.blueberry.example.com/api/me
🔴 ArgoCD Sync Issues
Symptoms:
- Apps stuck "OutOfSync"
- Sync failures
- Resources not updating
Quick Checks:
# List out-of-sync apps
argocd app list | grep OutOfSync
# Check specific app
argocd app get <app-name>
# Force sync
argocd app sync <app-name> --force
Troubleshooting by Component
Blueberry API
ArgoCD
Kubernetes/GKE
External Services
Debug Commands Cheatsheet
Logs
# API logs (last hour)
kubectl logs -n blueberry deployment/blueberry-api --since=1h
# All containers in a pod
kubectl logs -n blueberry <pod-name> --all-containers
# Previous container logs (after crash)
kubectl logs -n blueberry <pod-name> --previous
Describe Resources
# Pod details and events
kubectl describe pod -n blueberry <pod-name>
# Node issues
kubectl describe node <node-name>
# Service endpoints
kubectl describe svc -n blueberry blueberry-api
Network Debugging
# Test internal DNS
kubectl run test-pod --rm -it --image=busybox -- nslookup blueberry-api.blueberry
# Test service connectivity
kubectl run test-pod --rm -it --image=curlimages/curl -- curl http://blueberry-api.blueberry/health
# Check ingress
kubectl get ingress -n blueberry -o yaml
Resource Issues
# Check resource usage
kubectl top nodes
kubectl top pods -n blueberry
# Check resource requests/limits
kubectl describe deployment -n blueberry blueberry-api | grep -A5 "Limits\|Requests"
# Find pods in trouble
kubectl get pods --all-namespaces | grep -E "Evicted|Error|CrashLoop"
Emergency Procedures
🚨 Complete API Outage
- Check API Outage Playbook
- Scale to 0 and back:
kubectl scale deployment/blueberry-api -n blueberry --replicas=0
- Check recent changes in Git
🚨 Database Unreachable
- Check Firestore status in GCP Console
- Verify service account permissions
- Test with
gcloud
CLI
🚨 Mass Environment Failures
- Check ArgoCD controller:
kubectl logs -n argocd deployment/argocd-application-controller
- Pause environment creation
- Fix root cause before resuming
Monitoring & Alerts
Key Dashboards
Critical Alerts to Watch
- API response time > 2s
- Error rate > 5%
- Pod restarts > 3 in 5 min
- Disk usage > 80%
Getting Help
Internal Resources
- Slack: #blueberry-help
- On-call: Check PagerDuty
- Wiki: Internal documentation
External Resources
Escalation Path
- Try troubleshooting guides
- Ask in #blueberry-help
- Page on-call for P1/P2
- Escalate to platform team lead
Document ID: guides/troubleshooting/README