Recording Rules — Prometheus Metrics

Why Recording Rules?

Consider a Grafana dashboard showing the P99 workflow latency for the past 7 days with 1-minute resolution. Without a recording rule, each dashboard refresh runs histogram_quantile(0.99, rate(...[5m])) over 7 days of data — potentially millions of data points. With a recording rule, Prometheus pre-computes this every 15 seconds and stores the result as a simple time series — the dashboard loads in milliseconds.

BizFirst Recording Rules

# recording-rules.yml
groups:
  - name: bizfirst.workflow.rules
    interval: 15s
    rules:

      # Pre-compute workflow error rate per tenant
      - record: bizfirst:workflow_error_rate:5m
        expr: |
          sum(rate(bizfirst_workflow_executions_total{status="failed"}[5m])) by (tenant_id)
          /
          sum(rate(bizfirst_workflow_executions_total[5m])) by (tenant_id)

      # Pre-compute total workflow throughput
      - record: bizfirst:workflow_throughput:5m
        expr: |
          sum(rate(bizfirst_workflow_executions_total[5m])) by (tenant_id)

      # Pre-compute P99 node latency by node type
      - record: bizfirst:node_p99_latency:5m
        expr: |
          histogram_quantile(0.99,
            sum(rate(bizfirst_node_execution_duration_seconds_bucket[5m])) by (node_type, le)
          )

      # Pre-compute P50 node latency
      - record: bizfirst:node_p50_latency:5m
        expr: |
          histogram_quantile(0.50,
            sum(rate(bizfirst_node_execution_duration_seconds_bucket[5m])) by (node_type, le)
          )

      # HIL SLA — percentage of tasks resolved within 24 hours
      - record: bizfirst:hil_sla_compliance:24h
        expr: |
          sum(rate(bizfirst_hil_suspension_duration_seconds_bucket{le="86400"}[24h])) by (tenant_id)
          /
          sum(rate(bizfirst_hil_suspension_duration_seconds_count[24h])) by (tenant_id)

      # EdgeStream message throughput per topic
      - record: bizfirst:edgestream_throughput:1m
        expr: |
          sum(rate(bizfirst_edgestream_messages_total[1m])) by (topic)

      # Octopus token cost rate
      - record: bizfirst:octopus_tokens_per_minute:5m
        expr: |
          sum(rate(bizfirst_octopus_tokens_total[5m])) by (tenant_id, model, type) * 60

Using Recording Rules in Dashboards

Replace expensive raw PromQL with the pre-computed recording rule name in Grafana panels:

# Instead of this (slow, runs on every dashboard refresh):
histogram_quantile(0.99,
  sum(rate(bizfirst_node_execution_duration_seconds_bucket[5m])) by (node_type, le)
)

# Use this (fast, pre-computed every 15s):
bizfirst:node_p99_latency:5m

# Filter by node type using the pre-computed series:
bizfirst:node_p99_latency:5m{node_type="DataFetchNode"}

Recording Rule Naming Convention

BizFirst follows Prometheus recording rule naming best practices:

# Format: <namespace>:<metric>:<aggregation_period>
# Examples:
bizfirst:workflow_error_rate:5m      # 5-minute window
bizfirst:node_p99_latency:5m         # Pre-computed P99
bizfirst:hil_sla_compliance:24h      # 24-hour SLA metric
bizfirst:edgestream_throughput:1m    # 1-minute throughput

Record Before Alerting on Expensive Expressions

Any alert rule that uses histogram_quantile() or multi-level aggregations should be backed by a recording rule. Alert rule evaluation and dashboard queries share the same Prometheus query engine — if an alert rule runs an expensive query every 15 seconds, it consumes resources that slow down dashboard queries. Pre-compute the expression and alert on the recording rule result instead.

← PromQL for BizFirst Next: Prometheus Alert Rules →