Cost Tracking Verification Guide

This guide provides operational procedures to verify that cost tracking is working correctly and storing accurate data for the Blueberry IDP.

Overview

The cost tracking system automatically calculates and stores cost data at multiple points in the environment lifecycle:
- Creation: Initial cost estimation based on resource configuration
- Updates: Cost recalculation when environment changes
- Termination: Final cost calculation before cleanup

Data Storage Architecture

1. Environment Model Fields

Cost data is stored in each environment document:

{
  "id": "env-12345",
  "name": "pr-123",
  "total_cost": 12.45,
  "resource_usage": {
    "cost_breakdown": {
      "gke_autopilot": {
        "cpu": {"cores": 0.5, "cost": 5.34},
        "memory": {"gb": 1.0, "cost": 1.18},
        "storage": {"gb": 15.0, "cost": 0.23},
        "total": 6.75
      },
      "services": {
        "breakdown": {
          "firestore": 2.10,
          "artifact_registry": 0.50,
          "gcs": 0.30,
          "load_balancer": 2.50,
          "secret_manager": 0.30
        },
        "total": 5.70
      },
      "networking": 1.01
    },
    "cost_per_hour": 0.52,
    "estimated_at": "2024-01-15T10:30:00Z"
  }
}

2. Telemetry Metrics

Cloud Monitoring metrics track:
- custom.googleapis.com/environment/estimated_cost_usd
- Tags: cpu_cores, memory_gb

3. Structured Logs

Cost events are logged with metadata:
- Updated environment cost
- Updated PR environment cost
- Updated final cost for environment before termination

Verification Procedures

1. Real-time Log Monitoring

Monitor cost calculation events in real-time:

# Watch application logs for cost updates
kubectl logs -f deployment/blueberry -n blueberry | grep -E "(cost|Updated environment cost|total_cost)"

# Check cleanup job logs
kubectl logs -l app.kubernetes.io/component=cleanup -n blueberry --tail=100

# Filter for cost-specific events
kubectl logs deployment/blueberry -n blueberry --since=1h | jq 'select(.msg | contains("cost"))'

2. API Verification

Use the cost tracking APIs to verify data:

# Get overall cost summary
curl -H "Authorization: Bearer $TOKEN" \
  https://blueberry.florenciacomuzzi.com/api/observability/costs/summary?days=30

# Check specific environment cost
ENV_ID="pr-123"
curl -H "Authorization: Bearer $TOKEN" \
  https://blueberry.florenciacomuzzi.com/api/observability/costs/environment/$ENV_ID

# Force cost recalculation
curl -X POST -H "Authorization: Bearer $TOKEN" \
  https://blueberry.florenciacomuzzi.com/api/observability/costs/environment/$ENV_ID/update

3. Database Verification

Via Firebase Console

  1. Navigate to Firebase Console → Firestore
  2. Select the environments collection
  3. For each environment document, verify:
  4. total_cost field exists and is > 0
  5. resource_usage contains complete breakdown
  6. resource_usage.estimated_at is recent

Via gcloud CLI

# Query environments missing cost data
gcloud firestore documents list \
  --collection-path=environments \
  --filter="total_cost=null" \
  --project=development-454916

4. Telemetry Verification

Query Cloud Monitoring for cost metrics:

# Get recent cost metrics
gcloud monitoring time-series list \
  --filter='metric.type="custom.googleapis.com/environment/estimated_cost_usd"' \
  --interval-end=$(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --interval-start=$(date -u -d '24 hours ago' +%Y-%m-%dT%H:%M:%SZ) \
  --project=development-454916

# Get average cost per environment
gcloud monitoring time-series list \
  --filter='metric.type="custom.googleapis.com/environment/estimated_cost_usd"' \
  --interval-end=now \
  --interval-start=-P1D \
  --aggregation='{"alignmentPeriod":"3600s","perSeriesAligner":"ALIGN_MEAN"}' \
  --project=development-454916

5. Automated Validation Script

Create and run this validation script:

#!/usr/bin/env python3
# scripts/validate-cost-tracking.py

import asyncio
import sys
from datetime import datetime, timezone, timedelta
sys.path.append('.')

from blueberry.stores.environment import EnvironmentStore
from blueberry.services.cost_tracker import cost_tracker

async def validate_cost_tracking():
    """Validate cost tracking data integrity."""
    store = EnvironmentStore()

    print("šŸ” Cost Tracking Validation Report")
    print("=" * 50)

    # Get all active environments
    active_envs = await store.list_active_environments()
    print(f"\nāœ“ Found {len(active_envs)} active environments")

    # Check for missing cost data
    missing_costs = []
    zero_costs = []
    stale_costs = []

    for env in active_envs:
        # Check if cost data exists
        if env.total_cost is None:
            missing_costs.append(env)
        elif env.total_cost == 0:
            zero_costs.append(env)

        # Check if cost data is stale (>24 hours old)
        if env.resource_usage and 'estimated_at' in env.resource_usage:
            estimated_at = datetime.fromisoformat(env.resource_usage['estimated_at'].replace('Z', '+00:00'))
            age = datetime.now(timezone.utc) - estimated_at
            if age > timedelta(hours=24):
                stale_costs.append((env, age))

    # Report findings
    print(f"\nšŸ“Š Cost Data Status:")
    print(f"  - Missing costs: {len(missing_costs)} environments")
    print(f"  - Zero costs: {len(zero_costs)} environments")
    print(f"  - Stale costs (>24h): {len(stale_costs)} environments")

    # List problematic environments
    if missing_costs:
        print(f"\nāŒ Environments missing cost data:")
        for env in missing_costs[:5]:  # Show first 5
            print(f"  - {env.id} ({env.name})")

    if stale_costs:
        print(f"\nāš ļø  Environments with stale cost data:")
        for env, age in stale_costs[:5]:  # Show first 5
            print(f"  - {env.id}: {age.days}d {age.seconds//3600}h old")

    # Verify cost calculation
    if active_envs:
        print(f"\n🧮 Testing cost calculation on {active_envs[0].id}...")
        try:
            cost_data = cost_tracker.estimate_environment_cost(active_envs[0])
            print(f"  āœ“ Cost calculation successful: ${cost_data['total_estimated_cost']:.2f}")
        except Exception as e:
            print(f"  āŒ Cost calculation failed: {e}")

    # Summary
    total_cost = sum(env.total_cost or 0 for env in active_envs)
    avg_cost = total_cost / len(active_envs) if active_envs else 0

    print(f"\nšŸ’° Cost Summary:")
    print(f"  - Total estimated cost: ${total_cost:.2f}")
    print(f"  - Average per environment: ${avg_cost:.2f}")

    # Return status
    issues = len(missing_costs) + len(zero_costs) + len(stale_costs)
    if issues == 0:
        print(f"\nāœ… All cost tracking data is valid!")
        return 0
    else:
        print(f"\nāš ļø  Found {issues} issues that need attention")
        return 1

if __name__ == "__main__":
    exit_code = asyncio.run(validate_cost_tracking())
    sys.exit(exit_code)

Run the validation:

python scripts/validate-cost-tracking.py

6. CronJob Monitoring

Verify the cleanup job is running and updating costs:

# Check CronJob status
kubectl get cronjobs -n blueberry

# View last execution
kubectl get jobs -n blueberry | grep cleanup

# Check for successful cost updates in cleanup logs
kubectl logs -l app.kubernetes.io/component=cleanup -n blueberry | grep "Updated final cost"

7. Dashboard Health Check

  1. Access the cost dashboard: https://blueberry.florenciacomuzzi.com/api/observability/costs/dashboard
  2. Verify:
  3. Total cost is non-zero
  4. Active environments count matches reality
  5. Cost trends show data for recent days
  6. Service breakdown pie chart has data
  7. Environment list shows cost values

8. Firestore Index Performance

Check if cost queries are using indexes efficiently:

# Deploy indexes if not already done
firebase deploy --only firestore:indexes --project blueberry-e6167

# Monitor slow queries in logs
gcloud logging read 'resource.type="datastore_index" severity>=WARNING' \
  --project=development-454916 \
  --limit=50

Troubleshooting

Missing Cost Data

If environments are missing cost data:

  1. Check if the environment was created before cost tracking was implemented
  2. Manually trigger cost calculation:
    bash curl -X POST -H "Authorization: Bearer $TOKEN" \ https://blueberry.florenciacomuzzi.com/api/observability/costs/environment/{env_id}/update
  3. Check logs for calculation errors

Zero Cost Values

Zero costs typically indicate:
- Environment just created (race condition)
- Calculation error (check logs)
- Missing resource configuration

Stale Cost Data

For environments with outdated cost estimates:
1. The cleanup CronJob should update costs every 6 hours
2. Manually update if needed via API
3. Check if CronJob is running properly

Performance Issues

If cost queries are slow:
1. Ensure Firestore indexes are deployed
2. Check for missing indexes in logs
3. Consider adding pagination to queries

Monitoring Alerts

Consider setting up alerts for:

  1. Missing Cost Data
    yaml alert: EnvironmentMissingCostData expr: count(environments{total_cost="null"}) > 5 for: 30m

  2. Stale Cost Data
    yaml alert: EnvironmentStaleCostData expr: time() - environment_cost_updated_timestamp > 86400 for: 1h

  3. Cleanup Job Failures
    yaml alert: CleanupJobFailed expr: kube_job_status_failed{job_name=~".*cleanup.*"} > 0 for: 10m

Regular Maintenance

Daily

  • Check dashboard for anomalies
  • Review any cost-related alerts

Weekly

  • Run validation script
  • Review cost trends for outliers
  • Check for environments with unusually high costs

Monthly

  • Analyze cost optimization recommendations
  • Review and update pricing model if needed
  • Audit long-running environments

Useful Queries

Find Most Expensive Environments

SELECT id, name, total_cost, created_at
FROM environments
WHERE status IN ('READY', 'PROVISIONING')
ORDER BY total_cost DESC
LIMIT 10

Calculate Daily Spend Rate

SELECT
  DATE(created_at) as day,
  SUM(total_cost) as daily_cost,
  COUNT(*) as env_count
FROM environments
WHERE created_at > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
GROUP BY day
ORDER BY day DESC

Identify Cost Optimization Candidates

SELECT id, name, total_cost, ttl_hours
FROM environments
WHERE status = 'READY'
  AND ttl_hours > 72
  AND total_cost > 50
ORDER BY total_cost DESC

References

Document ID: guides/monitoring/cost-tracking-verification