Tempo HA
Tempo supports distributed deployment for high availability. With S3 as the backend, multiple Tempo ingesters can accept spans simultaneously — if one ingester fails, others continue without data loss. The S3 backend provides the durability; ingester replicas provide the write availability.
Tempo Distributed Components
| Component | Role | Replicas for HA |
|---|---|---|
| Distributor | Receives OTLP traces; routes to ingesters via consistent hash ring | 2+ |
| Ingester | Buffers spans in WAL; flushes blocks to S3 | 3 (replication factor 3) |
| Querier | Executes TraceQL queries against S3 and ingester cache | 2+ |
| Query Frontend | Caches and shards queries | 2 |
| Compactor | Merges blocks; enforces retention | 1 (singleton) |
Tempo HA Helm Configuration
# tempo-distributed-values.yaml
tempo-distributed:
ingester:
replicas: 3
config:
replication_factor: 3 # Write to 3 ingesters simultaneously
persistence:
enabled: true
size: 10Gi
distributor:
replicas: 2
querier:
replicas: 2
compactor:
replicas: 1
storage:
trace:
backend: s3
s3:
bucket: bizfirst-tempo-traces
endpoint: s3.amazonaws.com
region: us-east-1
access_key: ${AWS_ACCESS_KEY}
secret_key: ${AWS_SECRET_KEY}
# Install:
helm install tempo grafana/tempo-distributed \
--namespace observe \
--values tempo-distributed-values.yaml
Is Tempo HA Worth the Complexity?
Tempo HA adds significant operational complexity. Consider the trade-offs before upgrading:
| Factor | Single-Node Tempo | Tempo HA |
|---|---|---|
| Write HA | No — single point of failure | Yes — 3 replicas, replication factor 3 |
| Data durability | S3 backend: 11-nines durability | Same — S3 backend |
| Trace loss on ingester failure | WAL data in memory is lost | Other ingesters have the data (replication factor) |
| Operational complexity | Low — single process | High — 5 component types, ring coordination |
| Resource cost | ~4 CPU, ~8 GB RAM | ~16 CPU, ~32 GB RAM |
For most BizFirstGO deployments, losing a few minutes of traces during a Tempo restart is acceptable — traces are a debugging aid, not a system of record. The WAL ensures that buffered traces are persisted locally before an orderly shutdown. Consider Tempo HA only if you have a hard requirement for zero trace loss and cannot tolerate brief ingestion gaps during Kubernetes rolling updates.