BizFirst Observe
Prometheus
Prometheus is the metrics collection and storage component of the default stack. It uses a pull model — scraping /metrics endpoints every 15 seconds — and stores time series in its embedded TSDB. PromQL enables rich aggregation and alerting queries.
The Pull Model
Unlike Loki and Tempo which receive pushed telemetry from the OTel Collector, Prometheus actively pulls metrics from its targets. Each BizFirstGO service exposes a /metrics endpoint in Prometheus text format (or OpenMetrics format), and Prometheus scrapes it on a configurable interval.
This pull model has important properties:
- Prometheus controls the scrape rate — no service can overload it with metric pushes
- If a service goes down, Prometheus immediately detects it (the scrape fails)
- Service discovery is flexible — Prometheus can discover targets from Kubernetes, Consul, DNS, or static config
BizFirstGO Metric Types
| Type | Description | BizFirstGO Example | Use Case |
|---|---|---|---|
| Counter | Always increases; resets on restart | bizfirst_workflow_executions_total | Rate calculations: rate()[5m] |
| Gauge | Can go up or down | bizfirst_hil_pending_count | Current state: backlogs, connection counts |
| Histogram | Distribution with configurable buckets | bizfirst_node_execution_duration_seconds | Latency percentiles: histogram_quantile(0.99, ...) |
| Summary | Pre-calculated percentiles (client-side) | Rarely used in BizFirstGO | When percentiles must be exact |
Key BizFirstGO Metrics
# Workflow executions (counter, labeled by tenant and outcome)
bizfirst_workflow_executions_total{tenant_id="t123", status="success"}
bizfirst_workflow_executions_total{tenant_id="t123", status="failed"}
bizfirst_workflow_executions_total{tenant_id="t123", status="timeout"}
# Node execution latency histogram
bizfirst_node_execution_duration_seconds_bucket{node_type="ApprovalNode", tenant_id="t123", le="0.1"}
bizfirst_node_execution_duration_seconds_bucket{node_type="ApprovalNode", tenant_id="t123", le="1.0"}
bizfirst_node_execution_duration_seconds_bucket{node_type="ApprovalNode", tenant_id="t123", le="5.0"}
bizfirst_node_execution_duration_seconds_count{node_type="ApprovalNode", tenant_id="t123"}
bizfirst_node_execution_duration_seconds_sum{node_type="ApprovalNode", tenant_id="t123"}
# HIL metrics
bizfirst_hil_pending_count{tenant_id="t123"} # Current backlog
bizfirst_hil_suspension_duration_seconds_bucket{...} # Wait time distribution
# EdgeStream throughput
bizfirst_edgestream_messages_total{topic="workflow.events", tenant_id="t123"}
# Active connections
bizfirst_active_connections{service="flow-studio-signalr"}
PromQL Quick Reference
# Error rate over 5 minutes (ratio of errors to total)
rate(bizfirst_workflow_executions_total{status="failed"}[5m])
/
rate(bizfirst_workflow_executions_total[5m])
# P99 node execution latency by node type
histogram_quantile(0.99,
sum(rate(bizfirst_node_execution_duration_seconds_bucket[5m])) by (node_type, le)
)
# HIL backlog across all tenants
sum(bizfirst_hil_pending_count)
# EdgeStream throughput (messages per second)
rate(bizfirst_edgestream_messages_total[1m])
Deep Dive Available
For the complete Prometheus reference — scrape configuration, recording rules, Alertmanager setup, and the full BizFirstGO metrics catalog — see Guide4: Prometheus.