Table of Contents
Readiness Checks Improvements
Problem Statement
The readiness checks were too aggressive, marking environments as FAILED too quickly during normal provisioning in GKE Autopilot:
Environment status updated from provisioning to EnvironmentStatus.FAILED.
Environment not ready - 4 check(s) failed: ArgoCD Sync Status, ArgoCD Health Status, Service URL Health, Service URL Health
Root Causes
- Short grace period: Only 2 minutes before health checks started
- Strict failure criteria: Normal provisioning states (OutOfSync, Missing resources) were treated as failures
- No retry logic: First failure immediately marked environment as FAILED
- Aggressive frontend polling: Status checks every 30 seconds from the start
Solutions Implemented
1. Extended Grace Period
- Increased from 120 seconds (2 minutes) to 300 seconds (5 minutes)
- Gives GKE Autopilot more time to provision nodes and start pods
- During grace period, only a warning status is returned
2. Smarter Status Determination
- WARNING vs FAILED: Transient issues now result in WARNING status
- Non-recoverable errors only: Only mark as FAILED for:
- "not found" errors (ArgoCD app doesn't exist)
- Permission errors
- Other unrecoverable issues
3. ArgoCD Check Improvements
Sync Status Check
OutOfSync
andUnknown
changed from FAILED to WARNING- These are normal states during initial provisioning
- Only truly problematic sync states result in failure
Health Status Check
Missing
changed from FAILED to WARNING- Resources may not exist yet during provisioning
- Only
Degraded
status results in FAILED
4. URL Health Check Age Awareness
- Environments younger than 10 minutes get WARNING for unreachable URLs
- Older environments get FAILED for unreachable URLs
- Recognizes that services take time to become available
5. Frontend Progressive Backoff
- First 3 status checks: every 30 seconds
- After that: every 60 seconds
- Reduces load on the system during provisioning
Configuration
The following settings control readiness check behavior:
# charts/blueberry/values.yaml
config:
environmentMonitoringGracePeriodSeconds: '300' # 5 minutes
environmentMonitoringIntervalSeconds: '30' # How often backend checks
environmentProvisioningTimeoutHours: '2.0' # When to give up
Testing
Use the test script to verify behavior:
python scripts/test_readiness_checks.py
Expected output:
- Young environments stay in grace period
- Typical provisioning issues result in WARNING (stay in PROVISIONING)
- Only non-recoverable errors result in FAILED status
Impact
- Environments have more time to provision successfully
- Fewer false-positive failures during normal GKE Autopilot provisioning
- Better user experience with appropriate status messages
- Reduced system load from less aggressive polling
Monitoring
Watch for:
- Environments stuck in PROVISIONING for too long (> 2 hours)
- Actual failures being missed (should be rare)
- Grace period effectiveness in your environment
Adjust environmentMonitoringGracePeriodSeconds
based on your typical provisioning times.
Document ID: guides/monitoring/readiness-checks-improvements