Portal Community

When to Upgrade from the Default Stack

SignalThresholdRecommended Upgrade
Log ingestion rate> 10 GB/dayLoki distributed mode (microservices)
Metric retention needed> 90 daysThanos for long-term storage
Prometheus HA requirementZero downtime during scrapingThanos + multiple Prometheus replicas
Grafana SSO requiredCorporate directory integration neededGrafana Enterprise (SAML/OIDC)
Tenant count> 50 tenantsLoki microservices + Kubernetes deployment
Availability SLA99.9% uptime requiredFull Kubernetes HA deployment
Multi-regionTeams in > 1 AWS regionCross-region Prometheus federation

Enterprise Stack Decision Tree

Loki Distributed

Split Loki into querier, ingester, distributor, compactor — each scales independently. Use when log volume exceeds single-node capacity or when you need write HA.

Thanos

Add Thanos Sidecar to Prometheus for S3 block upload. Add Thanos Querier for deduplicated multi-Prometheus queries. Required for >90-day metric retention or HA metrics.

Tempo HA

Multiple Tempo ingesters with object storage backend. Required when you cannot afford trace ingestion downtime. Adds complexity — justify with a real SLA requirement.

Grafana Enterprise

SSO (SAML, OIDC), query audit logging, data source caching, reporting. Required for corporate identity integration and compliance audit log of who queried what.

Kubernetes Native

Deploy all components via Helm charts on Kubernetes. Enables HPA, PDB, rolling updates, and GitOps. Required for production deployments with SLA commitments.

Multi-Region

Prometheus federation for cross-region dashboards. Loki multi-cluster query for global log search. Required when BizFirstGO is deployed in multiple AWS/Azure regions.

Start Simple, Scale When Needed

The default single-node stack handles most BizFirstGO deployments well. Upgrading to enterprise options adds operational complexity — more components to monitor, more configuration to maintain. Only upgrade specific components when you have a concrete requirement (e.g., actual Prometheus downtime incidents) rather than preemptively scaling everything.