Blueberry IDP System Architecture - Monitoring & Background Tasks

Overview

This document explains the monitoring, readiness checks, and background task systems in Blueberry IDP through architectural diagrams.

1. High-Level System Architecture

The first diagram shows how readiness checks, monitoring, and background tasks interact:

Key Components:

  • Readiness Check System: Validates environment health with a 5-minute grace period
  • Background Tasks: Automated processes running on schedules
  • Monitoring & Metrics: Performance tracking and observability
  • Status Decision Logic: Intelligent status determination based on check results

Important Flows:

  1. Grace Period: New environments get 5 minutes before health checks start
  2. Progressive Status: Warnings don't fail environments, only non-recoverable errors do
  3. Continuous Monitoring: Every 30 seconds for provisioning environments
  4. Cleanup Automation: Every 6 hours for expired/stuck environments

2. Monitoring Flow Details

The second diagram breaks down the monitoring and cleanup task flows:

Environment Monitoring (Every 30s):

  • Finds all PROVISIONING environments
  • Checks age to determine if in grace period
  • Runs readiness checks after grace period
  • Makes intelligent status decisions

Cleanup Tasks (Every 6 hours):

  • Stuck Provisioning: Environments > 2 hours → FAILED
  • Expired Environments: Past TTL → TERMINATED
  • Old Terminated: > 7 days → Hard deleted

Frontend Status Checking:

  • Progressive backoff: 30s → 30s → 30s → 60s intervals
  • Reduces server load while maintaining responsiveness

3. Provisioning Timeline

The sequence diagram shows a typical environment provisioning flow:

Phase 1: Grace Period (0-5 minutes)

  • No detailed health checks
  • Frontend checks every 30 seconds
  • Environment protected from premature failure

Phase 2: Active Monitoring (5+ minutes)

  • ArgoCD sync/health checks return WARNINGs for normal states
  • URL checks are age-aware (< 10 min = WARNING)
  • Environment stays in PROVISIONING with warnings

Phase 3: Ready State

  • All checks pass
  • Status updated to READY
  • Frontend reloads to show success

4. Cleanup and Cost Tracking

The final diagram details background cleanup and cost tracking:

Cleanup CronJob Tasks:

  1. Stuck Provisioning: Find and fail environments stuck > 2 hours
  2. Expired Cleanup: Terminate environments past their TTL
  3. Hard Deletion: Remove old terminated environments from storage
  4. Audit Cleanup: Maintain 90-day audit log retention

Cost Tracking Flow:

  • Starts when environment is created
  • Tracks resource usage every 30 minutes
  • Calculates final cost on termination
  • Stores metrics for billing/reporting

Performance Metrics:

  • Provisioning time percentiles (P50, P95, P99)
  • Success/failure rates
  • Resource utilization
  • Cached in Redis for fast access

Configuration

Key settings that control these systems:

# Monitoring Configuration
environmentMonitoringEnabled: true
environmentMonitoringIntervalSeconds: 30
environmentMonitoringGracePeriodSeconds: 300  # 5 minutes
environmentProvisioningTimeoutHours: 2.0

# Cleanup Configuration
cleanup:
  schedule: "0 */6 * * *"  # Every 6 hours
  expiredEnvironmentLimit: 50
  terminatedEnvironmentDays: 7
  provisioningTimeoutHours: 2.0
  auditRetentionDays: 90

Best Practices

  1. Monitor Grace Period Effectiveness: Adjust based on your GKE provisioning times
  2. Review Failure Patterns: Look for common non-recoverable errors
  3. Cost Optimization: Use metrics to identify expensive environments
  4. Cleanup Tuning: Adjust limits based on system load

Troubleshooting

Environments Stuck in PROVISIONING

  • Check if grace period is sufficient
  • Verify ArgoCD applications are created
  • Look for resource quota issues

High Failure Rate

  • Review non-recoverable error patterns
  • Check for permission issues
  • Verify cluster capacity

Missing Metrics

  • Ensure Redis is accessible
  • Check environment monitor logs
  • Verify Firestore connectivity
Document ID: guides/monitoring/system-architecture-diagrams