Enterprise Options Overview — Enterprise Options

When to Upgrade from the Default Stack

Signal	Threshold	Recommended Upgrade
Log ingestion rate	> 10 GB/day	Loki distributed mode (microservices)
Metric retention needed	> 90 days	Thanos for long-term storage
Prometheus HA requirement	Zero downtime during scraping	Thanos + multiple Prometheus replicas
Grafana SSO required	Corporate directory integration needed	Grafana Enterprise (SAML/OIDC)
Tenant count	> 50 tenants	Loki microservices + Kubernetes deployment
Availability SLA	99.9% uptime required	Full Kubernetes HA deployment
Multi-region	Teams in > 1 AWS region	Cross-region Prometheus federation

Enterprise Stack Decision Tree

Loki Distributed

Split Loki into querier, ingester, distributor, compactor — each scales independently. Use when log volume exceeds single-node capacity or when you need write HA.

Thanos

Add Thanos Sidecar to Prometheus for S3 block upload. Add Thanos Querier for deduplicated multi-Prometheus queries. Required for >90-day metric retention or HA metrics.

Tempo HA

Multiple Tempo ingesters with object storage backend. Required when you cannot afford trace ingestion downtime. Adds complexity — justify with a real SLA requirement.

Grafana Enterprise

SSO (SAML, OIDC), query audit logging, data source caching, reporting. Required for corporate identity integration and compliance audit log of who queried what.

Kubernetes Native

Deploy all components via Helm charts on Kubernetes. Enables HPA, PDB, rolling updates, and GitOps. Required for production deployments with SLA commitments.

Multi-Region

Prometheus federation for cross-region dashboards. Loki multi-cluster query for global log search. Required when BizFirst is deployed in multiple AWS/Azure regions.

Start Simple, Scale When Needed

The default single-node stack handles most BizFirst deployments well. Upgrading to enterprise options adds operational complexity — more components to monitor, more configuration to maintain. Only upgrade specific components when you have a concrete requirement (e.g., actual Prometheus downtime incidents) rather than preemptively scaling everything.

← Using the Tools Guide Next: Loki Distributed Mode →