End-to-End Provisioning Time Tracking

This document describes the provisioning time tracking system implemented in Blueberry IDP.

Overview

The provisioning tracking system captures detailed timing information throughout an environment's lifecycle:

  • Creation Time: When the environment was first requested
  • Provisioning Start: When ArgoCD application creation begins
  • Ready Time: When all readiness checks pass
  • Termination Start: When deletion is initiated
  • Termination Complete: When all resources are cleaned up

Implementation Details

1. Lifecycle Event Tracking

Events are tracked using the EnvironmentMonitor service:

from blueberry.services.environment_monitor import EnvironmentMonitor

monitor = EnvironmentMonitor()
await monitor.track_environment_lifecycle(
    environment_id,
    "provisioning_started",
    {
        "repository": repo_name,
        "branch": branch,
        "commit_sha": commit[:8],
    }
)

2. Automatic Status Updates

The EnvironmentMonitorService periodically checks provisioning environments:

  • Runs readiness checks (ArgoCD sync, health, URL availability)
  • Updates status when environment becomes ready
  • Records ready_at timestamp
  • Tracks provisioning duration

3. Data Storage

Environment Model

  • created_at: When environment was created
  • ready_at: When environment became fully ready (NEW)
  • terminated_at: When environment was terminated

Redis Cache

  • Temporary storage of lifecycle events
  • 24-hour retention for event timeline

Audit Logs

  • Permanent record of all state transitions
  • Includes timing metadata

API Endpoints

Get Environment Timeline

GET /api/environments/{environment_id}/timeline

Returns chronological list of all lifecycle events:

{
  "environment_id": "abc123",
  "environment_name": "test-env",
  "timeline": [
    {
      "event": "provisioning_started",
      "timestamp": "2025-01-07T10:00:00Z",
      "metadata": {
        "repository": "backend-api",
        "branch": "main"
      }
    },
    {
      "event": "ready",
      "timestamp": "2025-01-07T10:03:45Z",
      "metadata": {
        "provisioning_duration_seconds": 225,
        "readiness_summary": "All checks passed"
      }
    }
  ],
  "summary": {
    "provisioning_duration_seconds": 225,
    "total_lifetime_hours": 24.5
  }
}

Get Performance Metrics

GET /api/insights/performance-detailed

Returns aggregated performance metrics:

{
  "environment_performance": {
    "provisioning": {
      "average_seconds": 180,
      "p50_seconds": 150,
      "p95_seconds": 300,
      "p99_seconds": 450,
      "success_rate": 95.5,
      "sample_size": 100
    }
  }
}

Performance Targets

Based on the tracked metrics, these are the recommended targets:

Metric Target Alert Threshold
P50 Provisioning Time < 2 minutes > 3 minutes
P95 Provisioning Time < 5 minutes > 10 minutes
Success Rate > 98% < 95%
Cache Hit Rate > 80% < 60%

Using the Test Script

Test the lifecycle tracking with the provided script:

# Set your API token
export BLUEBERRY_API_TOKEN=<your-token>

# Run the test script
python scripts/test-environment-lifecycle.py

The script will:
1. Display current performance metrics
2. Allow you to check a specific environment's timeline
3. Show recommendations based on metrics

Integration with Monitoring

Prometheus Metrics (Future)

The system is designed to export metrics to Prometheus:

# Metric definitions
environment_provisioning_duration = Histogram(
    'blueberry_environment_provisioning_duration_seconds',
    'Time to provision environment',
    buckets=[30, 60, 120, 300, 600]  # 30s, 1m, 2m, 5m, 10m
)

environment_provisioning_total = Counter(
    'blueberry_environment_provisioning_total',
    'Total number of provisioning attempts',
    ['status']  # success/failed
)

Grafana Dashboards (Future)

Recommended dashboard panels:
- Provisioning time percentiles (line chart)
- Success rate over time (gauge)
- Current provisioning queue (stat)
- Failed provisions by reason (pie chart)

Troubleshooting

No Provisioning Times Showing

If metrics show 0 seconds for provisioning:
- Historical environments don't have ready_at field
- Only new environments created after this feature will have timing data
- Check that readiness checks are running (check-status endpoint)

Missing Timeline Events

If timeline is incomplete:
- Check Redis connectivity for event caching
- Verify audit logs are being written
- Ensure lifecycle tracking calls are successful

Performance Recommendations

The system provides automatic recommendations:
- Slow provisioning: Check image sizes, ArgoCD sync waves
- Low success rate: Review error logs, check resource quotas
- High resource usage: Consider autoscaling, resource optimization

Best Practices

  1. Monitor Regularly: Check performance metrics weekly
  2. Set Alerts: Configure alerts for degraded performance
  3. Optimize Images: Keep container images small for faster pulls
  4. Parallel Provisioning: Use ArgoCD sync waves effectively
  5. Cache Warming: Pre-pull images on nodes if possible

Future Enhancements

  1. Webhook Integration: Send provisioning events to external systems
  2. SLA Tracking: Define and monitor SLAs per environment type
  3. Cost Attribution: Correlate provisioning time with cost
  4. Predictive Scaling: Pre-scale based on provisioning patterns
  5. Automated Remediation: Auto-retry failed provisions
Document ID: guides/monitoring/provisioning-tracking