Table of Contents
End-to-End Provisioning Time Tracking
This document describes the provisioning time tracking system implemented in Blueberry IDP.
Overview
The provisioning tracking system captures detailed timing information throughout an environment's lifecycle:
- Creation Time: When the environment was first requested
- Provisioning Start: When ArgoCD application creation begins
- Ready Time: When all readiness checks pass
- Termination Start: When deletion is initiated
- Termination Complete: When all resources are cleaned up
Implementation Details
1. Lifecycle Event Tracking
Events are tracked using the EnvironmentMonitor
service:
from blueberry.services.environment_monitor import EnvironmentMonitor
monitor = EnvironmentMonitor()
await monitor.track_environment_lifecycle(
environment_id,
"provisioning_started",
{
"repository": repo_name,
"branch": branch,
"commit_sha": commit[:8],
}
)
2. Automatic Status Updates
The EnvironmentMonitorService
periodically checks provisioning environments:
- Runs readiness checks (ArgoCD sync, health, URL availability)
- Updates status when environment becomes ready
- Records
ready_at
timestamp - Tracks provisioning duration
3. Data Storage
Environment Model
created_at
: When environment was createdready_at
: When environment became fully ready (NEW)terminated_at
: When environment was terminated
Redis Cache
- Temporary storage of lifecycle events
- 24-hour retention for event timeline
Audit Logs
- Permanent record of all state transitions
- Includes timing metadata
API Endpoints
Get Environment Timeline
GET /api/environments/{environment_id}/timeline
Returns chronological list of all lifecycle events:
{
"environment_id": "abc123",
"environment_name": "test-env",
"timeline": [
{
"event": "provisioning_started",
"timestamp": "2025-01-07T10:00:00Z",
"metadata": {
"repository": "backend-api",
"branch": "main"
}
},
{
"event": "ready",
"timestamp": "2025-01-07T10:03:45Z",
"metadata": {
"provisioning_duration_seconds": 225,
"readiness_summary": "All checks passed"
}
}
],
"summary": {
"provisioning_duration_seconds": 225,
"total_lifetime_hours": 24.5
}
}
Get Performance Metrics
GET /api/insights/performance-detailed
Returns aggregated performance metrics:
{
"environment_performance": {
"provisioning": {
"average_seconds": 180,
"p50_seconds": 150,
"p95_seconds": 300,
"p99_seconds": 450,
"success_rate": 95.5,
"sample_size": 100
}
}
}
Performance Targets
Based on the tracked metrics, these are the recommended targets:
Metric | Target | Alert Threshold |
---|---|---|
P50 Provisioning Time | < 2 minutes | > 3 minutes |
P95 Provisioning Time | < 5 minutes | > 10 minutes |
Success Rate | > 98% | < 95% |
Cache Hit Rate | > 80% | < 60% |
Using the Test Script
Test the lifecycle tracking with the provided script:
# Set your API token
export BLUEBERRY_API_TOKEN=<your-token>
# Run the test script
python scripts/test-environment-lifecycle.py
The script will:
1. Display current performance metrics
2. Allow you to check a specific environment's timeline
3. Show recommendations based on metrics
Integration with Monitoring
Prometheus Metrics (Future)
The system is designed to export metrics to Prometheus:
# Metric definitions
environment_provisioning_duration = Histogram(
'blueberry_environment_provisioning_duration_seconds',
'Time to provision environment',
buckets=[30, 60, 120, 300, 600] # 30s, 1m, 2m, 5m, 10m
)
environment_provisioning_total = Counter(
'blueberry_environment_provisioning_total',
'Total number of provisioning attempts',
['status'] # success/failed
)
Grafana Dashboards (Future)
Recommended dashboard panels:
- Provisioning time percentiles (line chart)
- Success rate over time (gauge)
- Current provisioning queue (stat)
- Failed provisions by reason (pie chart)
Troubleshooting
No Provisioning Times Showing
If metrics show 0 seconds for provisioning:
- Historical environments don't have ready_at
field
- Only new environments created after this feature will have timing data
- Check that readiness checks are running (check-status
endpoint)
Missing Timeline Events
If timeline is incomplete:
- Check Redis connectivity for event caching
- Verify audit logs are being written
- Ensure lifecycle tracking calls are successful
Performance Recommendations
The system provides automatic recommendations:
- Slow provisioning: Check image sizes, ArgoCD sync waves
- Low success rate: Review error logs, check resource quotas
- High resource usage: Consider autoscaling, resource optimization
Best Practices
- Monitor Regularly: Check performance metrics weekly
- Set Alerts: Configure alerts for degraded performance
- Optimize Images: Keep container images small for faster pulls
- Parallel Provisioning: Use ArgoCD sync waves effectively
- Cache Warming: Pre-pull images on nodes if possible
Future Enhancements
- Webhook Integration: Send provisioning events to external systems
- SLA Tracking: Define and monitor SLAs per environment type
- Cost Attribution: Correlate provisioning time with cost
- Predictive Scaling: Pre-scale based on provisioning patterns
- Automated Remediation: Auto-retry failed provisions