Table of Contents
- Monitoring & Observability
- Environments created
- API request duration
- Active environments
- Config overrides
Monitoring & Observability
Comprehensive guide to monitoring the Blueberry IDP infrastructure and applications.
Overview
Our monitoring stack provides visibility into:
- Application performance and errors
- Infrastructure health and resource usage
- User activity and business metrics
- Security events and compliance
Monitoring Stack
Components
- Metrics: Google Cloud Monitoring (Stackdriver)
- Logs: Cloud Logging + Blueberry structured logs
- Traces: OpenTelemetry → Cloud Trace
- Alerts: Cloud Monitoring → PagerDuty/Slack
- Dashboards: Cloud Monitoring + Custom dashboards
Key Dashboards
🎯 Blueberry Overview
Main operational dashboard showing:
- API request rate and latency
- Error rates by endpoint
- Active environments count
- Resource utilization
Access: GCP Console
📊 Performance Metrics
Detailed performance analysis:
- P50/P95/P99 latencies
- Database query times
- Cache hit rates
- Slow endpoints
🔍 Environment Health
Environment-specific monitoring:
- Environment creation success rate
- Ready state timing
- Resource usage per environment
- Stuck/failed environments
💰 Cost Tracking
Cloud spending visibility:
- Daily/weekly/monthly costs
- Cost per environment
- Resource type breakdown
- Optimization opportunities
Key Metrics & SLIs
API Availability
- SLI: Successful requests / Total requests
- Target: 99.9% availability
- Alert: < 99.5% over 5 minutes
API Latency
- SLI: P95 latency < 500ms
- Target: 95% of requests
- Alert: P95 > 1s for 5 minutes
Environment Creation Success
- SLI: Successful creations / Total attempts
- Target: 99% success rate
- Alert: < 95% over 1 hour
Time to Ready
- SLI: Environments ready within 5 minutes
- Target: 90% of environments
- Alert: P90 > 10 minutes
Alerts Configuration
Critical Alerts (P1) → PagerDuty
- API Down: Health check failing
- Cluster Unhealthy: Nodes NotReady
- Database Unreachable: Firestore errors
- High Error Rate: >10% errors
Warning Alerts (P2) → Slack
- High Latency: P95 > 2s
- Resource Pressure: >80% CPU/Memory
- Stuck Environments: Not ready after 10min
- Cost Spike: Daily spend >$500
Info Alerts (P3) → Email
- Certificate Expiry: <30 days
- Disk Usage: >70%
- Stale Environments: >7 days old
Log Queries
Common Searches
API Errors:
resource.type="k8s_container"
resource.labels.namespace_name="blueberry"
severity>=ERROR
Slow Requests:
resource.type="k8s_container"
jsonPayload.duration_ms>1000
jsonPayload.path!="/health"
Environment Creation Issues:
resource.labels.namespace_name="blueberry"
jsonPayload.event="environment_creation_failed"
Authentication Failures:
jsonPayload.event="auth_failed"
OR jsonPayload.status_code=401
Useful Filters
- By User:
jsonPayload.user_id="..."
- By Environment:
jsonPayload.environment_id="..."
- By Time Range:
timestamp>="2024-01-01"
- By HTTP Method:
jsonPayload.method="POST"
Custom Metrics
Application Metrics
# Environments created
environments_created_total{status="success|failed"}
# API request duration
http_request_duration_seconds{method, path, status}
# Active environments
active_environments_gauge
# Config overrides
config_overrides_total{environment_id}
Business Metrics
# User activity
active_users_daily
api_calls_per_user
# Environment usage
environment_lifetime_hours
environments_per_repository
Monitoring Procedures
Daily Checks
- Review overnight alerts
- Check dashboard for anomalies
- Verify backup completion
- Review cost trends
Weekly Reviews
- Analyze performance trends
- Review error patterns
- Check capacity planning
- Update alert thresholds
Monthly Analysis
- SLI/SLO review
- Cost optimization
- Capacity planning
- Alert effectiveness
Troubleshooting Monitoring
Missing Metrics
# Check metrics agent
kubectl logs -n kube-system -l app=metrics-server
# Verify metric export
kubectl exec -n blueberry deployment/blueberry-api -- curl localhost:8080/metrics
Alert Not Firing
- Check alert policy configuration
- Verify notification channels
- Test with manual metric
- Check PagerDuty integration
Dashboard Loading Issues
- Check time range selection
- Verify metric availability
- Check IAM permissions
- Try incognito mode
Creating New Monitoring
Adding Metrics
- Instrument code with Prometheus client
- Expose on
/metrics
endpoint - Configure scraping
- Create dashboard
Adding Alerts
- Define SLI and threshold
- Create alert policy in console
- Configure notification channel
- Document in runbooks
Dashboard Best Practices
- Group related metrics
- Use consistent time ranges
- Add descriptions
- Include troubleshooting links
- Set up auto-refresh
External Monitoring
Synthetic Monitoring
- Uptime checks from multiple regions
- API endpoint monitoring
- SSL certificate monitoring
Real User Monitoring
- Frontend performance metrics
- JavaScript error tracking
- User journey analytics