Data Retention Overview
Logs, metrics, and traces have different value curves over time — and different costs. A trace is useful for debugging an incident within hours; a metric is valuable for capacity planning months later; a log may be required for compliance for years. Configuring the right retention for each signal keeps storage costs manageable without sacrificing observability coverage.
Recommended Default Retention
| Signal | Storage Backend | Hot (Local) Retention | Cold (Object Store) Retention | Primary Use After Hot Period |
|---|---|---|---|---|
| Logs | Loki | 30 days | 1 year (S3 Glacier) | Compliance audit, post-incident investigation |
| Metrics | Prometheus / Thanos | 90 days | Indefinite (Thanos Object Store) | Capacity planning, trend analysis, SLA reporting |
| Traces | Tempo | 7 days | Not recommended (high volume) | Traces older than 7 days rarely needed — sample and discard |
Why Different Retention Periods?
Traces: Short Retention
Traces are used for active incident debugging — usually within hours of an issue. Storing all spans for more than 7 days generates enormous storage costs with very little value. Use tail sampling to keep 100% of error traces and 5-10% of success traces.
Logs: Medium Retention
Logs are the primary audit trail for what happened in a workflow. 30 days covers most post-incident investigations. Cold storage (S3 Glacier) for 1 year covers compliance requirements without hot storage costs.
Metrics: Long Retention
Metrics are compact (a few KB per time series per day). Keeping metrics for months or years enables capacity planning — "at current growth rate, when will we need more servers?" This is not possible with short metric retention.
Compliance Drives Minimums
Audit requirements may mandate minimum retention periods. For financial workflows: SOX requires 7 years for audit logs; GDPR requires the ability to delete within 30 days. Configure retention to satisfy both requirements simultaneously.
Storage Cost Estimates
| Signal | 10 tenants, moderate load | Monthly S3 cost (us-east-1) |
|---|---|---|
| Logs (30-day hot, Loki) | ~50 GB/month | ~$1.15/month (S3 Standard) |
| Logs (cold, S3 Glacier) | ~600 GB/year | ~$2.40/year (S3 Glacier) |
| Metrics (Prometheus TSDB, 90 days) | ~10 GB | Negligible (local disk) |
| Metrics (Thanos, 2-year history) | ~80 GB | ~$1.84/month (S3 Standard-IA) |
| Traces (Tempo, 7-day, 10% sampled) | ~15 GB | ~$0.35/month (S3 Standard) |
The estimates above assume ~100 workflow executions/hour across 10 tenants. High-volume deployments (10,000+ executions/hour) generate 100x more telemetry data. Always measure your actual log byte rate before setting retention periods — use the Loki metric sum(rate(loki_distributor_bytes_received_total[1h])).