Readiness Checks Improvements

Problem Statement

The readiness checks were too aggressive, marking environments as FAILED too quickly during normal provisioning in GKE Autopilot:

Environment status updated from provisioning to EnvironmentStatus.FAILED.
Environment not ready - 4 check(s) failed: ArgoCD Sync Status, ArgoCD Health Status, Service URL Health, Service URL Health

Root Causes

Short grace period: Only 2 minutes before health checks started
Strict failure criteria: Normal provisioning states (OutOfSync, Missing resources) were treated as failures
No retry logic: First failure immediately marked environment as FAILED
Aggressive frontend polling: Status checks every 30 seconds from the start

Solutions Implemented

1. Extended Grace Period

Increased from 120 seconds (2 minutes) to 300 seconds (5 minutes)
Gives GKE Autopilot more time to provision nodes and start pods
During grace period, only a warning status is returned

2. Smarter Status Determination

WARNING vs FAILED: Transient issues now result in WARNING status
Non-recoverable errors only: Only mark as FAILED for:
"not found" errors (ArgoCD app doesn't exist)
Permission errors
Other unrecoverable issues

3. ArgoCD Check Improvements

Sync Status Check

OutOfSync and Unknown changed from FAILED to WARNING
These are normal states during initial provisioning
Only truly problematic sync states result in failure

Health Status Check

Missing changed from FAILED to WARNING
Resources may not exist yet during provisioning
Only Degraded status results in FAILED

4. URL Health Check Age Awareness

Environments younger than 10 minutes get WARNING for unreachable URLs
Older environments get FAILED for unreachable URLs
Recognizes that services take time to become available

5. Frontend Progressive Backoff

First 3 status checks: every 30 seconds
After that: every 60 seconds
Reduces load on the system during provisioning

Configuration

The following settings control readiness check behavior:

# charts/blueberry/values.yaml
config:
  environmentMonitoringGracePeriodSeconds: '300'  # 5 minutes
  environmentMonitoringIntervalSeconds: '30'      # How often backend checks
  environmentProvisioningTimeoutHours: '2.0'      # When to give up

Testing

Use the test script to verify behavior:

python scripts/test_readiness_checks.py

Expected output:
- Young environments stay in grace period
- Typical provisioning issues result in WARNING (stay in PROVISIONING)
- Only non-recoverable errors result in FAILED status

Impact

Environments have more time to provision successfully
Fewer false-positive failures during normal GKE Autopilot provisioning
Better user experience with appropriate status messages
Reduced system load from less aggressive polling

Monitoring

Watch for:
- Environments stuck in PROVISIONING for too long (> 2 hours)
- Actual failures being missed (should be rare)
- Grace period effectiveness in your environment

Adjust environmentMonitoringGracePeriodSeconds based on your typical provisioning times.

Document ID: guides/monitoring/readiness-checks-improvements

Table of Contents