Portal Community

Log Volume Reduction

# 1. Set minimum log level to Information (not Debug) in production:
# appsettings.Production.json
{
  "OpenTelemetry": {
    "Logging": {
      "MinimumLevel": "Information"   // Never emit Debug logs to OTel in production
    }
  }
}

# 2. Sample high-volume, low-value log lines in the OTel Collector:
# otel-collector-config.yaml
processors:
  filter/drop-healthchecks:
    logs:
      exclude:
        match_type: regexp
        body: ".*health.*check.*"    # Drop health check log lines

  # Sample DEBUG-equivalent logs at 1%:
  probabilistic_sampler:
    sampling_percentage: 1

# 3. Use Loki's structured metadata to reduce per-line content:
# Instead of: "Processing execution exec-abc123 for tenant tenant-xyz in environment production"
# Use structured fields: executionId, tenant_id, environment as LABELS + short message

Metric Cardinality Control

# High cardinality is the primary cost driver for Prometheus storage.
# Rule: Never add a label with unbounded values (user ID, URL, email).

# BAD — executionId has millions of unique values:
bizfirst_workflow_executions_total{executionId="exec-abc123", status="success"}

# GOOD — use bounded labels only:
bizfirst_workflow_executions_total{workflow_type="approval", status="success", tenant_id="t123"}

# Audit your metric cardinality regularly:
# In Prometheus: http://localhost:9090/tsdb-status
# Shows top series by label set count

# Drop high-cardinality metrics at the OTel Collector before they reach Prometheus:
processors:
  filter/drop-high-cardinality:
    metrics:
      exclude:
        match_type: strict
        metric_names:
          - "http_server_requests_seconds"  # Drop if URL is a label

Trace Sampling Optimization

# Optimal tail sampling configuration for production:
processors:
  tail_sampling:
    decision_wait: 10s
    num_traces: 100000    # In-memory buffer size
    policies:
      # Keep 100% of error traces (most valuable)
      - name: errors
        type: status_code
        status_code: {status_codes: [ERROR]}
      # Keep 100% of slow traces (> 10s)
      - name: slow
        type: latency
        latency: {threshold_ms: 10000}
      # Keep 5% of everything else (random sample)
      - name: baseline
        type: probabilistic
        probabilistic: {sampling_percentage: 5}

# At 5% sampling with 1000 executions/hour:
# Without sampling: 1000 × 20 spans = 20,000 spans/hour = 480,000 spans/day
# With 5% sampling: 1000 spans/hour = 24,000 spans/day
# Storage reduction: 20x
# (Error and slow traces are always kept — 100% capture of important events)

Cost Summary — Before vs. After Optimization

SignalBefore OptimizationAfter OptimizationSavings
Logs (30 days)500 GB/month100 GB/month (Info-only + drop health checks)80%
Metrics (90 days)50 GB (high cardinality)10 GB (bounded labels)80%
Traces (7 days)150 GB (100% sampling)7.5 GB (5% sampling)95%
Measure Before Optimizing

Before applying cost optimizations, measure your actual current volumes. Use sum(rate(loki_distributor_bytes_received_total[1h])) by (tenant) in Prometheus to see log byte rate per tenant. Use prometheus_tsdb_head_series to see current metric cardinality. Optimization decisions should be data-driven — not based on assumptions.