Troubleshooting Guide
- Quick Diagnostics
  - System Health Check
Overall cluster health
Blueberry components
Recent errors
- Common Issues
  - 🔴 Environment Creation Failures
Check if namespace exists
Check ArgoCD application
Check API logs
Check pod resources
Check Redis
Check Firestore latency
(Check GCP Console metrics)
Check Firebase config
Test token validation
List out-of-sync apps
Check specific app
Force sync
- Troubleshooting by Component
- Debug Commands Cheatsheet
  - Logs
API logs (last hour)
All containers in a pod
Previous container logs (after crash)
Pod details and events
Node issues
Service endpoints
Test internal DNS
Test service connectivity
Check ingress
Check resource usage
Check resource requests/limits
Find pods in trouble
- Emergency Procedures
- Monitoring & Alerts
  - Key Dashboards
  - Critical Alerts to Watch
- Getting Help

Troubleshooting Guide

Quick reference for diagnosing and resolving common issues in the Blueberry IDP.

Quick Diagnostics

System Health Check

# Overall cluster health
kubectl get nodes
kubectl get pods --all-namespaces | grep -v Running

# Blueberry components
kubectl get all -n blueberry
kubectl get all -n argocd

# Recent errors
kubectl get events --all-namespaces --sort-by='.lastTimestamp' | grep Warning

Common Issues

🔴 Environment Creation Failures

Symptoms:
- Environment stuck in "Creating" state
- No namespace created
- ArgoCD app missing

Quick Checks:

# Check if namespace exists
kubectl get ns | grep <env-name>

# Check ArgoCD application
kubectl get application -n argocd env-<env-name>

# Check API logs
kubectl logs -n blueberry deployment/blueberry-api | grep <env-name>

Common Causes & Fixes:
1. GitLab sync issues → Check webhook, manually sync ArgoCD
2. Resource quotas → Check namespace quotas, clean up old envs
3. Invalid config → Validate ConfigSet values

Full Guide →

🔴 API Performance Issues

Symptoms:
- Slow response times
- Timeouts on API calls
- High latency

Quick Checks:

# Check pod resources
kubectl top pods -n blueberry

# Check Redis
kubectl exec -n blueberry deployment/blueberry-api -- redis-cli ping

# Check Firestore latency
# (Check GCP Console metrics)

Full Guide →

🔴 Authentication Failures

Symptoms:
- 401/403 errors
- "Invalid token" messages
- Login loops

Quick Checks:

# Check Firebase config
kubectl get configmap -n blueberry blueberry-config -o yaml | grep FIREBASE

# Test token validation
curl -H "Authorization: Bearer <token>" https://api.blueberry.example.com/api/me

Full Guide →

🔴 ArgoCD Sync Issues

Symptoms:
- Apps stuck "OutOfSync"
- Sync failures
- Resources not updating

Quick Checks:

# List out-of-sync apps
argocd app list | grep OutOfSync

# Check specific app
argocd app get <app-name>

# Force sync
argocd app sync <app-name> --force

Full Guide →

Troubleshooting by Component

Blueberry API

ArgoCD

Kubernetes/GKE

External Services

Debug Commands Cheatsheet

Logs

# API logs (last hour)
kubectl logs -n blueberry deployment/blueberry-api --since=1h

# All containers in a pod
kubectl logs -n blueberry <pod-name> --all-containers

# Previous container logs (after crash)
kubectl logs -n blueberry <pod-name> --previous

Describe Resources

# Pod details and events
kubectl describe pod -n blueberry <pod-name>

# Node issues
kubectl describe node <node-name>

# Service endpoints
kubectl describe svc -n blueberry blueberry-api

Network Debugging

# Test internal DNS
kubectl run test-pod --rm -it --image=busybox -- nslookup blueberry-api.blueberry

# Test service connectivity
kubectl run test-pod --rm -it --image=curlimages/curl -- curl http://blueberry-api.blueberry/health

# Check ingress
kubectl get ingress -n blueberry -o yaml

Resource Issues

# Check resource usage
kubectl top nodes
kubectl top pods -n blueberry

# Check resource requests/limits
kubectl describe deployment -n blueberry blueberry-api | grep -A5 "Limits\|Requests"

# Find pods in trouble
kubectl get pods --all-namespaces | grep -E "Evicted|Error|CrashLoop"

Emergency Procedures

🚨 Complete API Outage

Check API Outage Playbook
Scale to 0 and back: kubectl scale deployment/blueberry-api -n blueberry --replicas=0
Check recent changes in Git

🚨 Database Unreachable

Check Firestore status in GCP Console
Verify service account permissions
Test with gcloud CLI

🚨 Mass Environment Failures

Check ArgoCD controller: kubectl logs -n argocd deployment/argocd-application-controller
Pause environment creation
Fix root cause before resuming

Table of Contents

Troubleshooting Guide

Quick Diagnostics

System Health Check

Common Issues

🔴 Environment Creation Failures

🔴 API Performance Issues

🔴 Authentication Failures

🔴 ArgoCD Sync Issues

Troubleshooting by Component

Blueberry API

ArgoCD

Kubernetes/GKE

External Services

Debug Commands Cheatsheet

Logs

Describe Resources

Network Debugging

Resource Issues

Emergency Procedures

🚨 Complete API Outage

🚨 Database Unreachable

🚨 Mass Environment Failures

Monitoring & Alerts

Key Dashboards

Critical Alerts to Watch

Getting Help

Internal Resources

External Resources

Escalation Path