Monitoring & Observability

Comprehensive guide to monitoring the Blueberry IDP infrastructure and applications.

Overview

Our monitoring stack provides visibility into:
- Application performance and errors
- Infrastructure health and resource usage
- User activity and business metrics
- Security events and compliance

Monitoring Stack

Components

  • Metrics: Google Cloud Monitoring (Stackdriver)
  • Logs: Cloud Logging + Blueberry structured logs
  • Traces: OpenTelemetry → Cloud Trace
  • Alerts: Cloud Monitoring → PagerDuty/Slack
  • Dashboards: Cloud Monitoring + Custom dashboards

Key Dashboards

🎯 Blueberry Overview

Main operational dashboard showing:
- API request rate and latency
- Error rates by endpoint
- Active environments count
- Resource utilization

Access: GCP Console

📊 Performance Metrics

Detailed performance analysis:
- P50/P95/P99 latencies
- Database query times
- Cache hit rates
- Slow endpoints

🔍 Environment Health

Environment-specific monitoring:
- Environment creation success rate
- Ready state timing
- Resource usage per environment
- Stuck/failed environments

💰 Cost Tracking

Cloud spending visibility:
- Daily/weekly/monthly costs
- Cost per environment
- Resource type breakdown
- Optimization opportunities

Key Metrics & SLIs

API Availability

  • SLI: Successful requests / Total requests
  • Target: 99.9% availability
  • Alert: < 99.5% over 5 minutes

API Latency

  • SLI: P95 latency < 500ms
  • Target: 95% of requests
  • Alert: P95 > 1s for 5 minutes

Environment Creation Success

  • SLI: Successful creations / Total attempts
  • Target: 99% success rate
  • Alert: < 95% over 1 hour

Time to Ready

  • SLI: Environments ready within 5 minutes
  • Target: 90% of environments
  • Alert: P90 > 10 minutes

Alerts Configuration

Critical Alerts (P1) → PagerDuty

- API Down: Health check failing
- Cluster Unhealthy: Nodes NotReady
- Database Unreachable: Firestore errors
- High Error Rate: >10% errors

Warning Alerts (P2) → Slack

- High Latency: P95 > 2s
- Resource Pressure: >80% CPU/Memory
- Stuck Environments: Not ready after 10min
- Cost Spike: Daily spend >$500

Info Alerts (P3) → Email

- Certificate Expiry: <30 days
- Disk Usage: >70%
- Stale Environments: >7 days old

Log Queries

Common Searches

API Errors:

resource.type="k8s_container"
resource.labels.namespace_name="blueberry"
severity>=ERROR

Slow Requests:

resource.type="k8s_container"
jsonPayload.duration_ms>1000
jsonPayload.path!="/health"

Environment Creation Issues:

resource.labels.namespace_name="blueberry"
jsonPayload.event="environment_creation_failed"

Authentication Failures:

jsonPayload.event="auth_failed"
OR jsonPayload.status_code=401

Useful Filters

  • By User: jsonPayload.user_id="..."
  • By Environment: jsonPayload.environment_id="..."
  • By Time Range: timestamp>="2024-01-01"
  • By HTTP Method: jsonPayload.method="POST"

Custom Metrics

Application Metrics

# Environments created
environments_created_total{status="success|failed"}

# API request duration
http_request_duration_seconds{method, path, status}

# Active environments
active_environments_gauge

# Config overrides
config_overrides_total{environment_id}

Business Metrics

# User activity
active_users_daily
api_calls_per_user

# Environment usage
environment_lifetime_hours
environments_per_repository

Monitoring Procedures

Daily Checks

  1. Review overnight alerts
  2. Check dashboard for anomalies
  3. Verify backup completion
  4. Review cost trends

Weekly Reviews

  1. Analyze performance trends
  2. Review error patterns
  3. Check capacity planning
  4. Update alert thresholds

Monthly Analysis

  1. SLI/SLO review
  2. Cost optimization
  3. Capacity planning
  4. Alert effectiveness

Troubleshooting Monitoring

Missing Metrics

# Check metrics agent
kubectl logs -n kube-system -l app=metrics-server

# Verify metric export
kubectl exec -n blueberry deployment/blueberry-api -- curl localhost:8080/metrics

Alert Not Firing

  1. Check alert policy configuration
  2. Verify notification channels
  3. Test with manual metric
  4. Check PagerDuty integration

Dashboard Loading Issues

  1. Check time range selection
  2. Verify metric availability
  3. Check IAM permissions
  4. Try incognito mode

Creating New Monitoring

Adding Metrics

  1. Instrument code with Prometheus client
  2. Expose on /metrics endpoint
  3. Configure scraping
  4. Create dashboard

Adding Alerts

  1. Define SLI and threshold
  2. Create alert policy in console
  3. Configure notification channel
  4. Document in runbooks

Dashboard Best Practices

  • Group related metrics
  • Use consistent time ranges
  • Add descriptions
  • Include troubleshooting links
  • Set up auto-refresh

External Monitoring

Synthetic Monitoring

  • Uptime checks from multiple regions
  • API endpoint monitoring
  • SSL certificate monitoring

Real User Monitoring

  • Frontend performance metrics
  • JavaScript error tracking
  • User journey analytics
Document ID: guides/monitoring/README