Sizing Guidelines — Physical Architecture

Log Volume Estimation (Loki)

Log volume is the dominant storage cost in most BizFirst deployments. The primary driver is the number of node executions per day, since each node produces structured log entries at start, end, and on any exception.

Scenario	Executions/Day	Avg Nodes/Workflow	Log Volume/Day	30-Day Storage
Development	100	5	~500 MB	~15 GB
Small production	1,000	8	~4 GB	~120 GB
Medium production	10,000	10	~40 GB	~1.2 TB
Large production	100,000	12	~400 GB	~12 TB

Loki Compression Factor

Loki compresses log chunks with Snappy compression. JSON-structured logs from BizFirst typically achieve 5–8x compression. The storage figures above are after compression. Raw log volume before compression is approximately 5–8x higher.

Metric Storage Estimation (Prometheus)

Prometheus storage is driven by metric cardinality — the total number of unique time series (metric name + label combinations). BizFirst's default metric set is moderate cardinality.

Metric Family	Series Count (estimate)	Note
`bizfirst_workflow_executions_total`	N_tenants × 3 statuses	3 per tenant
`bizfirst_node_execution_duration_seconds`	N_tenants × N_node_types × 12 histogram buckets	~120 per tenant with 10 node types
`bizfirst_hil_pending_count`	N_tenants	1 per tenant
Infrastructure (Node Exporter)	~800 per host	Fixed overhead
Infrastructure (cAdvisor)	~200 per container	Scales with container count

Prometheus requires approximately 2 bytes per sample on disk. At 15-second scrape interval, each series generates 5,760 samples/day. For 10,000 series and 90-day retention:

10,000 series × 5,760 samples/day × 90 days × 2 bytes = ~10 GB

Trace Storage Estimation (Tempo)

Trace storage is driven by trace volume and average span size. Tempo compresses trace data efficiently using Parquet format.

Scenario	Traces/Day	Avg Spans/Trace	Trace Storage/Day	7-Day Storage
Development	1,000	10	~50 MB	~350 MB
Small production	10,000	15	~750 MB	~5 GB
Medium production	100,000	20	~10 GB	~70 GB
Large production	1,000,000	20	~100 GB	~700 GB

Component Resource Requirements

Component	Dev CPU	Dev RAM	Prod CPU	Prod RAM	Prod Disk
OTel Collector	0.5 vCPU	256 MB	2–4 vCPU	1–2 GB	None (stateless)
Loki	0.5 vCPU	512 MB	4–8 vCPU	8–16 GB	Local WAL only; data in S3
Prometheus	0.5 vCPU	512 MB	2–4 vCPU	8–16 GB	50–500 GB SSD
Tempo	0.5 vCPU	512 MB	4–8 vCPU	8–16 GB	Local WAL only; data in S3
Grafana	0.25 vCPU	256 MB	1–2 vCPU	1–2 GB	10 GB (SQLite/PostgreSQL)
Alertmanager	0.1 vCPU	64 MB	0.5 vCPU	256 MB	1 GB

Cost Reduction Strategies

Log level filtering — set production log level to Information or Warning; reduce Debug noise. A single log-level change from Debug to Information typically reduces log volume by 60–80%.
Trace sampling — in production, sample 100% of error traces and 5–10% of success traces. This reduces trace storage by ~90% with minimal loss of diagnostic value.
Metric cardinality control — avoid per-user or per-execution labels in metrics. Keep tenant count reasonable (thousands, not millions).
Object storage lifecycle rules — configure S3 lifecycle to transition Loki chunks to S3 Intelligent-Tiering after 30 days, and Glacier after 90 days. Cost reduction of 60–80% for archival data.
Prometheus remote write to Thanos — once local Prometheus retention exceeds 90 days, use Thanos for long-term storage at object storage cost rather than SSD cost.

← Network Topology