Readiness Checks Improvements

Problem Statement

The readiness checks were too aggressive, marking environments as FAILED too quickly during normal provisioning in GKE Autopilot:

Environment status updated from provisioning to EnvironmentStatus.FAILED.
Environment not ready - 4 check(s) failed: ArgoCD Sync Status, ArgoCD Health Status, Service URL Health, Service URL Health

Root Causes

  1. Short grace period: Only 2 minutes before health checks started
  2. Strict failure criteria: Normal provisioning states (OutOfSync, Missing resources) were treated as failures
  3. No retry logic: First failure immediately marked environment as FAILED
  4. Aggressive frontend polling: Status checks every 30 seconds from the start

Solutions Implemented

1. Extended Grace Period

  • Increased from 120 seconds (2 minutes) to 300 seconds (5 minutes)
  • Gives GKE Autopilot more time to provision nodes and start pods
  • During grace period, only a warning status is returned

2. Smarter Status Determination

  • WARNING vs FAILED: Transient issues now result in WARNING status
  • Non-recoverable errors only: Only mark as FAILED for:
  • "not found" errors (ArgoCD app doesn't exist)
  • Permission errors
  • Other unrecoverable issues

3. ArgoCD Check Improvements

Sync Status Check

  • OutOfSync and Unknown changed from FAILED to WARNING
  • These are normal states during initial provisioning
  • Only truly problematic sync states result in failure

Health Status Check

  • Missing changed from FAILED to WARNING
  • Resources may not exist yet during provisioning
  • Only Degraded status results in FAILED

4. URL Health Check Age Awareness

  • Environments younger than 10 minutes get WARNING for unreachable URLs
  • Older environments get FAILED for unreachable URLs
  • Recognizes that services take time to become available

5. Frontend Progressive Backoff

  • First 3 status checks: every 30 seconds
  • After that: every 60 seconds
  • Reduces load on the system during provisioning

Configuration

The following settings control readiness check behavior:

# charts/blueberry/values.yaml
config:
  environmentMonitoringGracePeriodSeconds: '300'  # 5 minutes
  environmentMonitoringIntervalSeconds: '30'      # How often backend checks
  environmentProvisioningTimeoutHours: '2.0'      # When to give up

Testing

Use the test script to verify behavior:

python scripts/test_readiness_checks.py

Expected output:
- Young environments stay in grace period
- Typical provisioning issues result in WARNING (stay in PROVISIONING)
- Only non-recoverable errors result in FAILED status

Impact

  • Environments have more time to provision successfully
  • Fewer false-positive failures during normal GKE Autopilot provisioning
  • Better user experience with appropriate status messages
  • Reduced system load from less aggressive polling

Monitoring

Watch for:
- Environments stuck in PROVISIONING for too long (> 2 hours)
- Actual failures being missed (should be rare)
- Grace period effectiveness in your environment

Adjust environmentMonitoringGracePeriodSeconds based on your typical provisioning times.

Document ID: guides/monitoring/readiness-checks-improvements