Collection Layer
The OpenTelemetry Collector is the nerve center of BizFirst Observe. It receives telemetry from every BizFirstGO service, processes and enriches it, and fans it out to the appropriate storage backends.
Why a Collector?
Without a Collector, every service would need its own configuration for each backend destination (Loki, Prometheus, Tempo). The Collector decouples services from backends: services always send to the Collector via OTLP, and the Collector handles routing, enrichment, and protocol translation. This also means you can add a new backend (e.g., Datadog) by changing only the Collector configuration — no code changes in services.
Collector Pipeline Architecture
The OTel Collector processes telemetry through a three-stage pipeline: Receivers → Processors → Exporters. Each signal type (logs, metrics, traces) has its own named pipeline.
# otel-collector-config.yaml — BizFirstGO reference configuration
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317 # Receives from all BizFirstGO services
http:
endpoint: 0.0.0.0:4318 # HTTP fallback (used by browser telemetry)
# Host metrics from Node Exporter (scrape-based)
prometheus:
config:
scrape_configs:
- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']
- job_name: 'cadvisor'
static_configs:
- targets: ['cadvisor:8080']
processors:
# Batch telemetry for efficient export
batch:
timeout: 5s
send_batch_size: 1024
# Add deployment metadata to all telemetry
resource:
attributes:
- key: deployment.environment
value: "${DEPLOYMENT_ENV}"
action: upsert
- key: cluster.name
value: "${CLUSTER_NAME}"
action: upsert
# Memory limiter — prevent OOM under high load
memory_limiter:
check_interval: 1s
limit_mib: 512
spike_limit_mib: 128
# PII redaction (see Guide9 for full config)
transform/redact:
log_statements:
- context: log
statements:
- replace_pattern(attributes["http.request.body"], "password=[^&]+", "password=[REDACTED]")
- replace_pattern(attributes["http.request.body"], "token=[^&]+", "token=[REDACTED]")
exporters:
# Logs → Loki
loki:
endpoint: http://loki:3100/loki/api/v1/push
default_labels_enabled:
exporter: false
job: true
instance: true
level: true
# Metrics → Prometheus (remote write)
prometheusremotewrite:
endpoint: http://prometheus:9090/api/v1/write
# Traces → Tempo
otlp/tempo:
endpoint: tempo:4317
tls:
insecure: true
service:
pipelines:
logs:
receivers: [otlp]
processors: [memory_limiter, resource, transform/redact, batch]
exporters: [loki]
metrics:
receivers: [otlp, prometheus]
processors: [memory_limiter, resource, batch]
exporters: [prometheusremotewrite]
traces:
receivers: [otlp]
processors: [memory_limiter, resource, batch]
exporters: [otlp/tempo]
Processor Reference
| Processor | Purpose | Applied To |
|---|---|---|
memory_limiter | Drops telemetry if memory exceeds limit — prevents OOM crashes | All pipelines |
batch | Groups telemetry into batches for efficient export | All pipelines |
resource | Adds/modifies resource attributes (environment, cluster name) | All pipelines |
transform/redact | Applies regex-based PII masking to log and span attributes | Logs + Traces |
probabilistic_sampler | Drops a configurable percentage of traces (production cost control) | Traces |
tail_sampling | Sample 100% of error traces, N% of success traces | Traces |
filter | Drops telemetry matching specific conditions (e.g., health check noise) | All pipelines |
Health Check Noise Filtering
Health check endpoints (/health, /ready, /live) are called every few seconds by load balancers and Kubernetes probes. Without filtering, they generate thousands of low-value trace spans per hour. Filter them out in the Collector:
processors:
filter/drop_health:
error_mode: ignore
traces:
span:
- 'attributes["http.route"] == "/health"'
- 'attributes["http.route"] == "/ready"'
- 'attributes["http.route"] == "/live"'
Collector Sizing
| Deployment Scale | Services | Collector CPU | Collector RAM | Replicas |
|---|---|---|---|---|
| Development | 1–5 | 0.5 vCPU | 256 MB | 1 |
| Small production | 5–20 | 1 vCPU | 512 MB | 1–2 |
| Medium production | 20–100 | 2–4 vCPU | 1–2 GB | 2–4 |
| Large production | 100+ | 4–8 vCPU | 2–4 GB | 4–8 (with load balancer) |
Place memory_limiter as the first processor in every pipeline. If the Collector runs out of memory and crashes, all telemetry from all services is lost until it restarts. The memory limiter gracefully drops telemetry when under pressure rather than crashing — a much better outcome.