Sizing Guidelines
Storage, CPU, and memory estimates for BizFirst Observe deployments. Use these as starting points — actual consumption depends on workflow complexity, log verbosity level, trace sampling rate, and metric cardinality.
Log Volume Estimation (Loki)
Log volume is the dominant storage cost in most BizFirstGO deployments. The primary driver is the number of node executions per day, since each node produces structured log entries at start, end, and on any exception.
| Scenario | Executions/Day | Avg Nodes/Workflow | Log Volume/Day | 30-Day Storage |
|---|---|---|---|---|
| Development | 100 | 5 | ~500 MB | ~15 GB |
| Small production | 1,000 | 8 | ~4 GB | ~120 GB |
| Medium production | 10,000 | 10 | ~40 GB | ~1.2 TB |
| Large production | 100,000 | 12 | ~400 GB | ~12 TB |
Loki compresses log chunks with Snappy compression. JSON-structured logs from BizFirstGO typically achieve 5–8x compression. The storage figures above are after compression. Raw log volume before compression is approximately 5–8x higher.
Metric Storage Estimation (Prometheus)
Prometheus storage is driven by metric cardinality — the total number of unique time series (metric name + label combinations). BizFirstGO's default metric set is moderate cardinality.
| Metric Family | Series Count (estimate) | Note |
|---|---|---|
bizfirst_workflow_executions_total | N_tenants × 3 statuses | 3 per tenant |
bizfirst_node_execution_duration_seconds | N_tenants × N_node_types × 12 histogram buckets | ~120 per tenant with 10 node types |
bizfirst_hil_pending_count | N_tenants | 1 per tenant |
| Infrastructure (Node Exporter) | ~800 per host | Fixed overhead |
| Infrastructure (cAdvisor) | ~200 per container | Scales with container count |
Prometheus requires approximately 2 bytes per sample on disk. At 15-second scrape interval, each series generates 5,760 samples/day. For 10,000 series and 90-day retention:
10,000 series × 5,760 samples/day × 90 days × 2 bytes = ~10 GB
Trace Storage Estimation (Tempo)
Trace storage is driven by trace volume and average span size. Tempo compresses trace data efficiently using Parquet format.
| Scenario | Traces/Day | Avg Spans/Trace | Trace Storage/Day | 7-Day Storage |
|---|---|---|---|---|
| Development | 1,000 | 10 | ~50 MB | ~350 MB |
| Small production | 10,000 | 15 | ~750 MB | ~5 GB |
| Medium production | 100,000 | 20 | ~10 GB | ~70 GB |
| Large production | 1,000,000 | 20 | ~100 GB | ~700 GB |
Component Resource Requirements
| Component | Dev CPU | Dev RAM | Prod CPU | Prod RAM | Prod Disk |
|---|---|---|---|---|---|
| OTel Collector | 0.5 vCPU | 256 MB | 2–4 vCPU | 1–2 GB | None (stateless) |
| Loki | 0.5 vCPU | 512 MB | 4–8 vCPU | 8–16 GB | Local WAL only; data in S3 |
| Prometheus | 0.5 vCPU | 512 MB | 2–4 vCPU | 8–16 GB | 50–500 GB SSD |
| Tempo | 0.5 vCPU | 512 MB | 4–8 vCPU | 8–16 GB | Local WAL only; data in S3 |
| Grafana | 0.25 vCPU | 256 MB | 1–2 vCPU | 1–2 GB | 10 GB (SQLite/PostgreSQL) |
| Alertmanager | 0.1 vCPU | 64 MB | 0.5 vCPU | 256 MB | 1 GB |
Cost Reduction Strategies
- Log level filtering — set production log level to
InformationorWarning; reduceDebugnoise. A single log-level change from Debug to Information typically reduces log volume by 60–80%. - Trace sampling — in production, sample 100% of error traces and 5–10% of success traces. This reduces trace storage by ~90% with minimal loss of diagnostic value.
- Metric cardinality control — avoid per-user or per-execution labels in metrics. Keep tenant count reasonable (thousands, not millions).
- Object storage lifecycle rules — configure S3 lifecycle to transition Loki chunks to S3 Intelligent-Tiering after 30 days, and Glacier after 90 days. Cost reduction of 60–80% for archival data.
- Prometheus remote write to Thanos — once local Prometheus retention exceeds 90 days, use Thanos for long-term storage at object storage cost rather than SSD cost.