Prometheus — Default Architecture

The Pull Model

Unlike Loki and Tempo which receive pushed telemetry from the OTel Collector, Prometheus actively pulls metrics from its targets. Each BizFirst service exposes a /metrics endpoint in Prometheus text format (or OpenMetrics format), and Prometheus scrapes it on a configurable interval.

This pull model has important properties:

Prometheus controls the scrape rate — no service can overload it with metric pushes
If a service goes down, Prometheus immediately detects it (the scrape fails)
Service discovery is flexible — Prometheus can discover targets from Kubernetes, Consul, DNS, or static config

BizFirst Metric Types

Type	Description	BizFirst Example	Use Case
Counter	Always increases; resets on restart	`bizfirst_workflow_executions_total`	Rate calculations: `rate()[5m]`
Gauge	Can go up or down	`bizfirst_hil_pending_count`	Current state: backlogs, connection counts
Histogram	Distribution with configurable buckets	`bizfirst_node_execution_duration_seconds`	Latency percentiles: `histogram_quantile(0.99, ...)`
Summary	Pre-calculated percentiles (client-side)	Rarely used in BizFirst	When percentiles must be exact

Key BizFirst Metrics

# Workflow executions (counter, labeled by tenant and outcome)
bizfirst_workflow_executions_total{tenant_id="t123", status="success"}
bizfirst_workflow_executions_total{tenant_id="t123", status="failed"}
bizfirst_workflow_executions_total{tenant_id="t123", status="timeout"}

# Node execution latency histogram
bizfirst_node_execution_duration_seconds_bucket{node_type="ApprovalNode", tenant_id="t123", le="0.1"}
bizfirst_node_execution_duration_seconds_bucket{node_type="ApprovalNode", tenant_id="t123", le="1.0"}
bizfirst_node_execution_duration_seconds_bucket{node_type="ApprovalNode", tenant_id="t123", le="5.0"}
bizfirst_node_execution_duration_seconds_count{node_type="ApprovalNode", tenant_id="t123"}
bizfirst_node_execution_duration_seconds_sum{node_type="ApprovalNode", tenant_id="t123"}

# HIL metrics
bizfirst_hil_pending_count{tenant_id="t123"}          # Current backlog
bizfirst_hil_suspension_duration_seconds_bucket{...}  # Wait time distribution

# EdgeStream throughput
bizfirst_edgestream_messages_total{topic="workflow.events", tenant_id="t123"}

# Active connections
bizfirst_active_connections{service="flow-studio-signalr"}

PromQL Quick Reference

# Error rate over 5 minutes (ratio of errors to total)
rate(bizfirst_workflow_executions_total{status="failed"}[5m])
  /
rate(bizfirst_workflow_executions_total[5m])

# P99 node execution latency by node type
histogram_quantile(0.99,
  sum(rate(bizfirst_node_execution_duration_seconds_bucket[5m])) by (node_type, le)
)

# HIL backlog across all tenants
sum(bizfirst_hil_pending_count)

# EdgeStream throughput (messages per second)
rate(bizfirst_edgestream_messages_total[1m])

Deep Dive Available

For the complete Prometheus reference — scrape configuration, recording rules, Alertmanager setup, and the full BizFirst metrics catalog — see Guide4: Prometheus.

← Grafana Loki Next: Grafana Tempo →