Troubleshooting Guide

Quick reference for diagnosing and resolving common issues in the Blueberry IDP.

Quick Diagnostics

System Health Check

# Overall cluster health
kubectl get nodes
kubectl get pods --all-namespaces | grep -v Running

# Blueberry components
kubectl get all -n blueberry
kubectl get all -n argocd

# Recent errors
kubectl get events --all-namespaces --sort-by='.lastTimestamp' | grep Warning

Common Issues

🔴 Environment Creation Failures

Symptoms:
- Environment stuck in "Creating" state
- No namespace created
- ArgoCD app missing

Quick Checks:

# Check if namespace exists
kubectl get ns | grep <env-name>

# Check ArgoCD application
kubectl get application -n argocd env-<env-name>

# Check API logs
kubectl logs -n blueberry deployment/blueberry-api | grep <env-name>

Common Causes & Fixes:
1. GitLab sync issues → Check webhook, manually sync ArgoCD
2. Resource quotas → Check namespace quotas, clean up old envs
3. Invalid config → Validate ConfigSet values

Full Guide →

🔴 API Performance Issues

Symptoms:
- Slow response times
- Timeouts on API calls
- High latency

Quick Checks:

# Check pod resources
kubectl top pods -n blueberry

# Check Redis
kubectl exec -n blueberry deployment/blueberry-api -- redis-cli ping

# Check Firestore latency
# (Check GCP Console metrics)

Full Guide →

🔴 Authentication Failures

Symptoms:
- 401/403 errors
- "Invalid token" messages
- Login loops

Quick Checks:

# Check Firebase config
kubectl get configmap -n blueberry blueberry-config -o yaml | grep FIREBASE

# Test token validation
curl -H "Authorization: Bearer <token>" https://api.blueberry.example.com/api/me

Full Guide →

🔴 ArgoCD Sync Issues

Symptoms:
- Apps stuck "OutOfSync"
- Sync failures
- Resources not updating

Quick Checks:

# List out-of-sync apps
argocd app list | grep OutOfSync

# Check specific app
argocd app get <app-name>

# Force sync
argocd app sync <app-name> --force

Full Guide →

Troubleshooting by Component

Blueberry API

ArgoCD

Kubernetes/GKE

External Services

Debug Commands Cheatsheet

Logs

# API logs (last hour)
kubectl logs -n blueberry deployment/blueberry-api --since=1h

# All containers in a pod
kubectl logs -n blueberry <pod-name> --all-containers

# Previous container logs (after crash)
kubectl logs -n blueberry <pod-name> --previous

Describe Resources

# Pod details and events
kubectl describe pod -n blueberry <pod-name>

# Node issues
kubectl describe node <node-name>

# Service endpoints
kubectl describe svc -n blueberry blueberry-api

Network Debugging

# Test internal DNS
kubectl run test-pod --rm -it --image=busybox -- nslookup blueberry-api.blueberry

# Test service connectivity
kubectl run test-pod --rm -it --image=curlimages/curl -- curl http://blueberry-api.blueberry/health

# Check ingress
kubectl get ingress -n blueberry -o yaml

Resource Issues

# Check resource usage
kubectl top nodes
kubectl top pods -n blueberry

# Check resource requests/limits
kubectl describe deployment -n blueberry blueberry-api | grep -A5 "Limits\|Requests"

# Find pods in trouble
kubectl get pods --all-namespaces | grep -E "Evicted|Error|CrashLoop"

Emergency Procedures

🚨 Complete API Outage

  1. Check API Outage Playbook
  2. Scale to 0 and back: kubectl scale deployment/blueberry-api -n blueberry --replicas=0
  3. Check recent changes in Git

🚨 Database Unreachable

  1. Check Firestore status in GCP Console
  2. Verify service account permissions
  3. Test with gcloud CLI

🚨 Mass Environment Failures

  1. Check ArgoCD controller: kubectl logs -n argocd deployment/argocd-application-controller
  2. Pause environment creation
  3. Fix root cause before resuming

Monitoring & Alerts

Key Dashboards

Critical Alerts to Watch

  • API response time > 2s
  • Error rate > 5%
  • Pod restarts > 3 in 5 min
  • Disk usage > 80%

Getting Help

Internal Resources

  • Slack: #blueberry-help
  • On-call: Check PagerDuty
  • Wiki: Internal documentation

External Resources

Escalation Path

  1. Try troubleshooting guides
  2. Ask in #blueberry-help
  3. Page on-call for P1/P2
  4. Escalate to platform team lead
Document ID: guides/troubleshooting/README