Table of Contents
Blueberry IDP System Architecture - Monitoring & Background Tasks
Overview
This document explains the monitoring, readiness checks, and background task systems in Blueberry IDP through architectural diagrams.
1. High-Level System Architecture
The first diagram shows how readiness checks, monitoring, and background tasks interact:
Key Components:
- Readiness Check System: Validates environment health with a 5-minute grace period
- Background Tasks: Automated processes running on schedules
- Monitoring & Metrics: Performance tracking and observability
- Status Decision Logic: Intelligent status determination based on check results
Important Flows:
- Grace Period: New environments get 5 minutes before health checks start
- Progressive Status: Warnings don't fail environments, only non-recoverable errors do
- Continuous Monitoring: Every 30 seconds for provisioning environments
- Cleanup Automation: Every 6 hours for expired/stuck environments
2. Monitoring Flow Details
The second diagram breaks down the monitoring and cleanup task flows:
Environment Monitoring (Every 30s):
- Finds all PROVISIONING environments
- Checks age to determine if in grace period
- Runs readiness checks after grace period
- Makes intelligent status decisions
Cleanup Tasks (Every 6 hours):
- Stuck Provisioning: Environments > 2 hours → FAILED
- Expired Environments: Past TTL → TERMINATED
- Old Terminated: > 7 days → Hard deleted
Frontend Status Checking:
- Progressive backoff: 30s → 30s → 30s → 60s intervals
- Reduces server load while maintaining responsiveness
3. Provisioning Timeline
The sequence diagram shows a typical environment provisioning flow:
Phase 1: Grace Period (0-5 minutes)
- No detailed health checks
- Frontend checks every 30 seconds
- Environment protected from premature failure
Phase 2: Active Monitoring (5+ minutes)
- ArgoCD sync/health checks return WARNINGs for normal states
- URL checks are age-aware (< 10 min = WARNING)
- Environment stays in PROVISIONING with warnings
Phase 3: Ready State
- All checks pass
- Status updated to READY
- Frontend reloads to show success
4. Cleanup and Cost Tracking
The final diagram details background cleanup and cost tracking:
Cleanup CronJob Tasks:
- Stuck Provisioning: Find and fail environments stuck > 2 hours
- Expired Cleanup: Terminate environments past their TTL
- Hard Deletion: Remove old terminated environments from storage
- Audit Cleanup: Maintain 90-day audit log retention
Cost Tracking Flow:
- Starts when environment is created
- Tracks resource usage every 30 minutes
- Calculates final cost on termination
- Stores metrics for billing/reporting
Performance Metrics:
- Provisioning time percentiles (P50, P95, P99)
- Success/failure rates
- Resource utilization
- Cached in Redis for fast access
Configuration
Key settings that control these systems:
# Monitoring Configuration
environmentMonitoringEnabled: true
environmentMonitoringIntervalSeconds: 30
environmentMonitoringGracePeriodSeconds: 300 # 5 minutes
environmentProvisioningTimeoutHours: 2.0
# Cleanup Configuration
cleanup:
schedule: "0 */6 * * *" # Every 6 hours
expiredEnvironmentLimit: 50
terminatedEnvironmentDays: 7
provisioningTimeoutHours: 2.0
auditRetentionDays: 90
Best Practices
- Monitor Grace Period Effectiveness: Adjust based on your GKE provisioning times
- Review Failure Patterns: Look for common non-recoverable errors
- Cost Optimization: Use metrics to identify expensive environments
- Cleanup Tuning: Adjust limits based on system load
Troubleshooting
Environments Stuck in PROVISIONING
- Check if grace period is sufficient
- Verify ArgoCD applications are created
- Look for resource quota issues
High Failure Rate
- Review non-recoverable error patterns
- Check for permission issues
- Verify cluster capacity
Missing Metrics
- Ensure Redis is accessible
- Check environment monitor logs
- Verify Firestore connectivity
Document ID: guides/monitoring/system-architecture-diagrams