BizFirst Observe
Tempo Trace Retention
Tempo manages trace retention through its Compactor component and the TTL (time-to-live) settings on the object storage backend. Unlike Loki, Tempo does not support per-tenant retention in the open-source version — retention applies globally.
Tempo Retention Configuration
# tempo-config.yaml — retention settings
compactor:
compaction:
block_retention: 168h # Delete blocks older than 7 days (168 hours)
compacted_block_retention: 1h # How long to keep compacted blocks before deletion
compaction_window: 1h # Blocks within this window are compacted together
max_block_bytes: 107374182400 # Max block size: 100GB
max_compaction_objects: 6000000
storage:
trace:
backend: s3
s3:
bucket: bizfirst-tempo-traces
endpoint: s3.amazonaws.com
region: us-east-1
wal:
path: /tmp/tempo/wal # WAL is on local disk
encoding: snappy
local:
path: /tmp/tempo/blocks
Tempo Block Lifecycle
| Phase | Location | Duration | Description |
|---|---|---|---|
| WAL (Write-Ahead Log) | Local disk | Seconds to minutes | Incoming spans buffered before flushing to S3 |
| Active block | S3 | ~1 hour | Recent spans, actively written |
| Compacted block | S3 | Up to block_retention (7 days) | Merged blocks for efficient query |
| Deleted | - | After block_retention expires | Permanently removed from S3 |
Storage Volume Calculation
# Tempo storage calculation for BizFirstGO:
# Assumptions: 100 executions/hour, 20 spans/execution, 10% sampling
executions_per_hour = 100
spans_per_execution = 20
sampling_rate = 0.10
bytes_per_span = 500 # Average span size (compressed)
spans_per_hour = executions_per_hour * spans_per_execution * sampling_rate
# = 100 * 20 * 0.10 = 200 spans/hour
bytes_per_hour = spans_per_hour * bytes_per_span
# = 200 * 500 = 100 KB/hour
bytes_per_day = bytes_per_hour * 24
# = 2.4 MB/day
bytes_for_7_days = bytes_per_day * 7
# = 16.8 MB for 7-day retention
# At 10x load (1000 executions/hour): ~168 MB for 7 days
# At 100x load (10,000 executions/hour): ~1.68 GB for 7 days
# Conclusion: Tempo storage is typically the smallest of the three signals
# Sampling is the primary cost control lever for Tempo
Adjusting Trace Sampling to Control Volume
# In otel-collector-config.yaml — tail_sampling processor:
processors:
tail_sampling:
decision_wait: 10s
policies:
# Always keep error traces
- name: errors
type: status_code
status_code: {status_codes: [ERROR]}
# Keep 5% of success traces (adjust this number)
- name: probabilistic-success
type: probabilistic
probabilistic: {sampling_percentage: 5}
# Always keep slow traces (> 5 seconds)
- name: slow-traces
type: latency
latency: {threshold_ms: 5000}
# Lower sampling_percentage to reduce Tempo storage cost.
# 5% is a good production default — catches outliers while reducing volume 20x.
7 Days Is Usually Sufficient
The vast majority of trace-based debugging happens within minutes to hours of an incident. It is rare to need traces older than 7 days. If you need to demonstrate a performance pattern over weeks, use metrics (Prometheus) — they are far more storage-efficient for trend analysis than raw traces.