Prometheus Alert Rules — Prometheus Metrics

BizFirst Alert Rules

# alert-rules.yml
groups:
  - name: bizfirst.workflow.alerts
    rules:

      # Workflow error rate > 1% for 5 minutes
      - alert: BizFirstWorkflowErrorRate
        expr: |
          (
            sum(rate(bizfirst_workflow_executions_total{status="failed"}[5m])) by (tenant_id)
            /
            sum(rate(bizfirst_workflow_executions_total[5m])) by (tenant_id)
          ) > 0.01
        for: 5m
        labels:
          severity: warning
          product: flow-studio
        annotations:
          summary: "Workflow error rate high for tenant {{ $labels.tenant_id }}"
          description: "Error rate is {{ printf \"%.2f\" ($value * 100) }}% (threshold: 1%)"
          runbook: "https://docs.bizfirstai.com/runbooks/workflow-error-rate"

      # P99 node latency > 5s for 5 minutes
      - alert: BizFirstNodeHighLatency
        expr: |
          histogram_quantile(0.99,
            sum(rate(bizfirst_node_execution_duration_seconds_bucket[5m])) by (node_type, le)
          ) > 5
        for: 5m
        labels:
          severity: warning
          product: flow-studio
        annotations:
          summary: "High P99 latency for node type {{ $labels.node_type }}"
          description: "P99 latency is {{ printf \"%.1f\" $value }}s (threshold: 5s)"

      # HIL backlog > 100 tasks
      - alert: BizFirstHILBacklogHigh
        expr: sum(bizfirst_hil_pending_count) by (tenant_id) > 100
        for: 10m
        labels:
          severity: warning
          team: operations
        annotations:
          summary: "HIL backlog high for tenant {{ $labels.tenant_id }}"
          description: "{{ $value }} tasks awaiting human action"

      # HIL overdue tasks (SLA breach)
      - alert: BizFirstHILSLABreach
        expr: sum(bizfirst_hil_overdue_count) by (tenant_id) > 0
        for: 0m
        labels:
          severity: critical
          team: operations
        annotations:
          summary: "HIL SLA breached for tenant {{ $labels.tenant_id }}"

      # Service down
      - alert: BizFirstServiceDown
        expr: up{job=~"processengine|edgestream|octopus|api-gateway"} == 0
        for: 1m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "{{ $labels.job }} is DOWN"

      # Disk running low on Prometheus
      - alert: PrometheusStorageLow
        expr: |
          predict_linear(prometheus_tsdb_blocks_loaded[1h], 4*3600) > 0.9
        for: 30m
        labels:
          severity: warning
          team: platform
        annotations:
          summary: "Prometheus disk usage projected to exceed 90% in 4 hours"

Alert Severity Levels

Severity	Meaning	Notification Channel	Response SLA
info	Informational — no action required immediately	Slack #platform-info	Next business day
warning	Degraded performance or approaching thresholds	Slack #platform-alerts	Within 4 hours
critical	Service impacted, SLA at risk	PagerDuty + Slack #incidents	Within 30 minutes

The for Duration Prevents Flapping

The for: 5m clause means the alert condition must be continuously true for 5 minutes before the alert fires. Without this, a momentary spike causes an alert that immediately resolves — generating unnecessary notifications ("flapping"). For production alerts, use for: 3m minimum. For critical infrastructure (service down), use for: 1m or even for: 0m for immediate notification.

← Recording Rules