Tempo Trace Retention — Data Retention & Archive

Tempo Retention Configuration

# tempo-config.yaml — retention settings
compactor:
  compaction:
    block_retention: 168h          # Delete blocks older than 7 days (168 hours)
    compacted_block_retention: 1h  # How long to keep compacted blocks before deletion
    compaction_window: 1h          # Blocks within this window are compacted together
    max_block_bytes: 107374182400  # Max block size: 100GB
    max_compaction_objects: 6000000

storage:
  trace:
    backend: s3
    s3:
      bucket: bizfirst-tempo-traces
      endpoint: s3.amazonaws.com
      region: us-east-1
    wal:
      path: /tmp/tempo/wal         # WAL is on local disk
      encoding: snappy
    local:
      path: /tmp/tempo/blocks

Tempo Block Lifecycle

Phase	Location	Duration	Description
WAL (Write-Ahead Log)	Local disk	Seconds to minutes	Incoming spans buffered before flushing to S3
Active block	S3	~1 hour	Recent spans, actively written
Compacted block	S3	Up to block_retention (7 days)	Merged blocks for efficient query
Deleted	-	After block_retention expires	Permanently removed from S3

Storage Volume Calculation

# Tempo storage calculation for BizFirst:
# Assumptions: 100 executions/hour, 20 spans/execution, 10% sampling

executions_per_hour = 100
spans_per_execution = 20
sampling_rate = 0.10
bytes_per_span = 500        # Average span size (compressed)

spans_per_hour = executions_per_hour * spans_per_execution * sampling_rate
             # = 100 * 20 * 0.10 = 200 spans/hour
bytes_per_hour = spans_per_hour * bytes_per_span
             # = 200 * 500 = 100 KB/hour
bytes_per_day = bytes_per_hour * 24
             # = 2.4 MB/day
bytes_for_7_days = bytes_per_day * 7
             # = 16.8 MB for 7-day retention

# At 10x load (1000 executions/hour): ~168 MB for 7 days
# At 100x load (10,000 executions/hour): ~1.68 GB for 7 days

# Conclusion: Tempo storage is typically the smallest of the three signals
# Sampling is the primary cost control lever for Tempo

Adjusting Trace Sampling to Control Volume

# In otel-collector-config.yaml — tail_sampling processor:
processors:
  tail_sampling:
    decision_wait: 10s
    policies:
      # Always keep error traces
      - name: errors
        type: status_code
        status_code: {status_codes: [ERROR]}
      # Keep 5% of success traces (adjust this number)
      - name: probabilistic-success
        type: probabilistic
        probabilistic: {sampling_percentage: 5}
      # Always keep slow traces (> 5 seconds)
      - name: slow-traces
        type: latency
        latency: {threshold_ms: 5000}

# Lower sampling_percentage to reduce Tempo storage cost.
# 5% is a good production default — catches outliers while reducing volume 20x.

7 Days Is Usually Sufficient

The vast majority of trace-based debugging happens within minutes to hours of an incident. It is rare to need traces older than 7 days. If you need to demonstrate a performance pattern over weeks, use metrics (Prometheus) — they are far more storage-efficient for trend analysis than raw traces.

← Prometheus Metric Retention Next: Archiving to Cold Storage →