Recording Rules
Recording rules pre-compute expensive PromQL expressions and store the results as new time series. Dashboard panels that use recording rules load instantly instead of running expensive queries against raw data. They are also the foundation for multi-day SLA calculations.
Why Recording Rules?
Consider a Grafana dashboard showing the P99 workflow latency for the past 7 days with 1-minute resolution. Without a recording rule, each dashboard refresh runs histogram_quantile(0.99, rate(...[5m])) over 7 days of data — potentially millions of data points. With a recording rule, Prometheus pre-computes this every 15 seconds and stores the result as a simple time series — the dashboard loads in milliseconds.
BizFirstGO Recording Rules
# recording-rules.yml
groups:
- name: bizfirst.workflow.rules
interval: 15s
rules:
# Pre-compute workflow error rate per tenant
- record: bizfirst:workflow_error_rate:5m
expr: |
sum(rate(bizfirst_workflow_executions_total{status="failed"}[5m])) by (tenant_id)
/
sum(rate(bizfirst_workflow_executions_total[5m])) by (tenant_id)
# Pre-compute total workflow throughput
- record: bizfirst:workflow_throughput:5m
expr: |
sum(rate(bizfirst_workflow_executions_total[5m])) by (tenant_id)
# Pre-compute P99 node latency by node type
- record: bizfirst:node_p99_latency:5m
expr: |
histogram_quantile(0.99,
sum(rate(bizfirst_node_execution_duration_seconds_bucket[5m])) by (node_type, le)
)
# Pre-compute P50 node latency
- record: bizfirst:node_p50_latency:5m
expr: |
histogram_quantile(0.50,
sum(rate(bizfirst_node_execution_duration_seconds_bucket[5m])) by (node_type, le)
)
# HIL SLA — percentage of tasks resolved within 24 hours
- record: bizfirst:hil_sla_compliance:24h
expr: |
sum(rate(bizfirst_hil_suspension_duration_seconds_bucket{le="86400"}[24h])) by (tenant_id)
/
sum(rate(bizfirst_hil_suspension_duration_seconds_count[24h])) by (tenant_id)
# EdgeStream message throughput per topic
- record: bizfirst:edgestream_throughput:1m
expr: |
sum(rate(bizfirst_edgestream_messages_total[1m])) by (topic)
# Octopus token cost rate
- record: bizfirst:octopus_tokens_per_minute:5m
expr: |
sum(rate(bizfirst_octopus_tokens_total[5m])) by (tenant_id, model, type) * 60
Using Recording Rules in Dashboards
Replace expensive raw PromQL with the pre-computed recording rule name in Grafana panels:
# Instead of this (slow, runs on every dashboard refresh):
histogram_quantile(0.99,
sum(rate(bizfirst_node_execution_duration_seconds_bucket[5m])) by (node_type, le)
)
# Use this (fast, pre-computed every 15s):
bizfirst:node_p99_latency:5m
# Filter by node type using the pre-computed series:
bizfirst:node_p99_latency:5m{node_type="DataFetchNode"}
Recording Rule Naming Convention
BizFirstGO follows Prometheus recording rule naming best practices:
# Format: <namespace>:<metric>:<aggregation_period>
# Examples:
bizfirst:workflow_error_rate:5m # 5-minute window
bizfirst:node_p99_latency:5m # Pre-computed P99
bizfirst:hil_sla_compliance:24h # 24-hour SLA metric
bizfirst:edgestream_throughput:1m # 1-minute throughput
Any alert rule that uses histogram_quantile() or multi-level aggregations should be backed by a recording rule. Alert rule evaluation and dashboard queries share the same Prometheus query engine — if an alert rule runs an expensive query every 15 seconds, it consumes resources that slow down dashboard queries. Pre-compute the expression and alert on the recording rule result instead.