Collection Layer — Physical Architecture

Why a Collector?

Without a Collector, every service would need its own configuration for each backend destination (Loki, Prometheus, Tempo). The Collector decouples services from backends: services always send to the Collector via OTLP, and the Collector handles routing, enrichment, and protocol translation. This also means you can add a new backend (e.g., Datadog) by changing only the Collector configuration — no code changes in services.

Collector Pipeline Architecture

The OTel Collector processes telemetry through a three-stage pipeline: Receivers → Processors → Exporters. Each signal type (logs, metrics, traces) has its own named pipeline.

# otel-collector-config.yaml — BizFirst reference configuration

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317      # Receives from all BizFirst services
      http:
        endpoint: 0.0.0.0:4318      # HTTP fallback (used by browser telemetry)

  # Host metrics from Node Exporter (scrape-based)
  prometheus:
    config:
      scrape_configs:
        - job_name: 'node-exporter'
          static_configs:
            - targets: ['node-exporter:9100']
        - job_name: 'cadvisor'
          static_configs:
            - targets: ['cadvisor:8080']

processors:
  # Batch telemetry for efficient export
  batch:
    timeout: 5s
    send_batch_size: 1024

  # Add deployment metadata to all telemetry
  resource:
    attributes:
      - key: deployment.environment
        value: "${DEPLOYMENT_ENV}"
        action: upsert
      - key: cluster.name
        value: "${CLUSTER_NAME}"
        action: upsert

  # Memory limiter — prevent OOM under high load
  memory_limiter:
    check_interval: 1s
    limit_mib: 512
    spike_limit_mib: 128

  # PII redaction (see Guide9 for full config)
  transform/redact:
    log_statements:
      - context: log
        statements:
          - replace_pattern(attributes["http.request.body"], "password=[^&]+", "password=[REDACTED]")
          - replace_pattern(attributes["http.request.body"], "token=[^&]+", "token=[REDACTED]")

exporters:
  # Logs → Loki
  loki:
    endpoint: http://loki:3100/loki/api/v1/push
    default_labels_enabled:
      exporter: false
      job: true
      instance: true
      level: true

  # Metrics → Prometheus (remote write)
  prometheusremotewrite:
    endpoint: http://prometheus:9090/api/v1/write

  # Traces → Tempo
  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true

service:
  pipelines:
    logs:
      receivers: [otlp]
      processors: [memory_limiter, resource, transform/redact, batch]
      exporters: [loki]

    metrics:
      receivers: [otlp, prometheus]
      processors: [memory_limiter, resource, batch]
      exporters: [prometheusremotewrite]

    traces:
      receivers: [otlp]
      processors: [memory_limiter, resource, batch]
      exporters: [otlp/tempo]

Processor Reference

Processor	Purpose	Applied To
`memory_limiter`	Drops telemetry if memory exceeds limit — prevents OOM crashes	All pipelines
`batch`	Groups telemetry into batches for efficient export	All pipelines
`resource`	Adds/modifies resource attributes (environment, cluster name)	All pipelines
`transform/redact`	Applies regex-based PII masking to log and span attributes	Logs + Traces
`probabilistic_sampler`	Drops a configurable percentage of traces (production cost control)	Traces
`tail_sampling`	Sample 100% of error traces, N% of success traces	Traces
`filter`	Drops telemetry matching specific conditions (e.g., health check noise)	All pipelines

Health Check Noise Filtering

Health check endpoints (/health, /ready, /live) are called every few seconds by load balancers and Kubernetes probes. Without filtering, they generate thousands of low-value trace spans per hour. Filter them out in the Collector:

processors:
  filter/drop_health:
    error_mode: ignore
    traces:
      span:
        - 'attributes["http.route"] == "/health"'
        - 'attributes["http.route"] == "/ready"'
        - 'attributes["http.route"] == "/live"'

Collector Sizing

Deployment Scale	Services	Collector CPU	Collector RAM	Replicas
Development	1–5	0.5 vCPU	256 MB	1
Small production	5–20	1 vCPU	512 MB	1–2
Medium production	20–100	2–4 vCPU	1–2 GB	2–4
Large production	100+	4–8 vCPU	2–4 GB	4–8 (with load balancer)

Always Configure memory_limiter First

Place memory_limiter as the first processor in every pipeline. If the Collector runs out of memory and crashes, all telemetry from all services is lost until it restarts. The memory limiter gracefully drops telemetry when under pressure rather than crashing — a much better outcome.

← Instrumentation Layer Next: Storage Layer →