BizFirst Observe
Cost Optimization
The three most impactful cost levers for BizFirst Observe are: log volume reduction (fewer log lines per execution), metric cardinality control (fewer unique label combinations), and trace sampling rate (fewer spans stored). Applying all three can reduce observability storage costs by 60-80% without sacrificing meaningful coverage.
Log Volume Reduction
# 1. Set minimum log level to Information (not Debug) in production:
# appsettings.Production.json
{
"OpenTelemetry": {
"Logging": {
"MinimumLevel": "Information" // Never emit Debug logs to OTel in production
}
}
}
# 2. Sample high-volume, low-value log lines in the OTel Collector:
# otel-collector-config.yaml
processors:
filter/drop-healthchecks:
logs:
exclude:
match_type: regexp
body: ".*health.*check.*" # Drop health check log lines
# Sample DEBUG-equivalent logs at 1%:
probabilistic_sampler:
sampling_percentage: 1
# 3. Use Loki's structured metadata to reduce per-line content:
# Instead of: "Processing execution exec-abc123 for tenant tenant-xyz in environment production"
# Use structured fields: executionId, tenant_id, environment as LABELS + short message
Metric Cardinality Control
# High cardinality is the primary cost driver for Prometheus storage.
# Rule: Never add a label with unbounded values (user ID, URL, email).
# BAD — executionId has millions of unique values:
bizfirst_workflow_executions_total{executionId="exec-abc123", status="success"}
# GOOD — use bounded labels only:
bizfirst_workflow_executions_total{workflow_type="approval", status="success", tenant_id="t123"}
# Audit your metric cardinality regularly:
# In Prometheus: http://localhost:9090/tsdb-status
# Shows top series by label set count
# Drop high-cardinality metrics at the OTel Collector before they reach Prometheus:
processors:
filter/drop-high-cardinality:
metrics:
exclude:
match_type: strict
metric_names:
- "http_server_requests_seconds" # Drop if URL is a label
Trace Sampling Optimization
# Optimal tail sampling configuration for production:
processors:
tail_sampling:
decision_wait: 10s
num_traces: 100000 # In-memory buffer size
policies:
# Keep 100% of error traces (most valuable)
- name: errors
type: status_code
status_code: {status_codes: [ERROR]}
# Keep 100% of slow traces (> 10s)
- name: slow
type: latency
latency: {threshold_ms: 10000}
# Keep 5% of everything else (random sample)
- name: baseline
type: probabilistic
probabilistic: {sampling_percentage: 5}
# At 5% sampling with 1000 executions/hour:
# Without sampling: 1000 × 20 spans = 20,000 spans/hour = 480,000 spans/day
# With 5% sampling: 1000 spans/hour = 24,000 spans/day
# Storage reduction: 20x
# (Error and slow traces are always kept — 100% capture of important events)
Cost Summary — Before vs. After Optimization
| Signal | Before Optimization | After Optimization | Savings |
|---|---|---|---|
| Logs (30 days) | 500 GB/month | 100 GB/month (Info-only + drop health checks) | 80% |
| Metrics (90 days) | 50 GB (high cardinality) | 10 GB (bounded labels) | 80% |
| Traces (7 days) | 150 GB (100% sampling) | 7.5 GB (5% sampling) | 95% |
Measure Before Optimizing
Before applying cost optimizations, measure your actual current volumes. Use sum(rate(loki_distributor_bytes_received_total[1h])) by (tenant) in Prometheus to see log byte rate per tenant. Use prometheus_tsdb_head_series to see current metric cardinality. Optimization decisions should be data-driven — not based on assumptions.