Portal Community

Why a Collector?

Without a Collector, every service would need its own configuration for each backend destination (Loki, Prometheus, Tempo). The Collector decouples services from backends: services always send to the Collector via OTLP, and the Collector handles routing, enrichment, and protocol translation. This also means you can add a new backend (e.g., Datadog) by changing only the Collector configuration — no code changes in services.

Collector Pipeline Architecture

The OTel Collector processes telemetry through a three-stage pipeline: Receivers → Processors → Exporters. Each signal type (logs, metrics, traces) has its own named pipeline.

# otel-collector-config.yaml — BizFirstGO reference configuration

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317      # Receives from all BizFirstGO services
      http:
        endpoint: 0.0.0.0:4318      # HTTP fallback (used by browser telemetry)

  # Host metrics from Node Exporter (scrape-based)
  prometheus:
    config:
      scrape_configs:
        - job_name: 'node-exporter'
          static_configs:
            - targets: ['node-exporter:9100']
        - job_name: 'cadvisor'
          static_configs:
            - targets: ['cadvisor:8080']

processors:
  # Batch telemetry for efficient export
  batch:
    timeout: 5s
    send_batch_size: 1024

  # Add deployment metadata to all telemetry
  resource:
    attributes:
      - key: deployment.environment
        value: "${DEPLOYMENT_ENV}"
        action: upsert
      - key: cluster.name
        value: "${CLUSTER_NAME}"
        action: upsert

  # Memory limiter — prevent OOM under high load
  memory_limiter:
    check_interval: 1s
    limit_mib: 512
    spike_limit_mib: 128

  # PII redaction (see Guide9 for full config)
  transform/redact:
    log_statements:
      - context: log
        statements:
          - replace_pattern(attributes["http.request.body"], "password=[^&]+", "password=[REDACTED]")
          - replace_pattern(attributes["http.request.body"], "token=[^&]+", "token=[REDACTED]")

exporters:
  # Logs → Loki
  loki:
    endpoint: http://loki:3100/loki/api/v1/push
    default_labels_enabled:
      exporter: false
      job: true
      instance: true
      level: true

  # Metrics → Prometheus (remote write)
  prometheusremotewrite:
    endpoint: http://prometheus:9090/api/v1/write

  # Traces → Tempo
  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true

service:
  pipelines:
    logs:
      receivers: [otlp]
      processors: [memory_limiter, resource, transform/redact, batch]
      exporters: [loki]

    metrics:
      receivers: [otlp, prometheus]
      processors: [memory_limiter, resource, batch]
      exporters: [prometheusremotewrite]

    traces:
      receivers: [otlp]
      processors: [memory_limiter, resource, batch]
      exporters: [otlp/tempo]

Processor Reference

ProcessorPurposeApplied To
memory_limiterDrops telemetry if memory exceeds limit — prevents OOM crashesAll pipelines
batchGroups telemetry into batches for efficient exportAll pipelines
resourceAdds/modifies resource attributes (environment, cluster name)All pipelines
transform/redactApplies regex-based PII masking to log and span attributesLogs + Traces
probabilistic_samplerDrops a configurable percentage of traces (production cost control)Traces
tail_samplingSample 100% of error traces, N% of success tracesTraces
filterDrops telemetry matching specific conditions (e.g., health check noise)All pipelines

Health Check Noise Filtering

Health check endpoints (/health, /ready, /live) are called every few seconds by load balancers and Kubernetes probes. Without filtering, they generate thousands of low-value trace spans per hour. Filter them out in the Collector:

processors:
  filter/drop_health:
    error_mode: ignore
    traces:
      span:
        - 'attributes["http.route"] == "/health"'
        - 'attributes["http.route"] == "/ready"'
        - 'attributes["http.route"] == "/live"'

Collector Sizing

Deployment ScaleServicesCollector CPUCollector RAMReplicas
Development1–50.5 vCPU256 MB1
Small production5–201 vCPU512 MB1–2
Medium production20–1002–4 vCPU1–2 GB2–4
Large production100+4–8 vCPU2–4 GB4–8 (with load balancer)
Always Configure memory_limiter First

Place memory_limiter as the first processor in every pipeline. If the Collector runs out of memory and crashes, all telemetry from all services is lost until it restarts. The memory limiter gracefully drops telemetry when under pressure rather than crashing — a much better outcome.