BizFirst Observe
Prometheus Alert Rules
Prometheus evaluates alert rules against its TSDB every 15 seconds. When a rule's expression evaluates to true for longer than the specified for duration, it fires an alert to Alertmanager which routes it to the appropriate notification channel.
BizFirstGO Alert Rules
# alert-rules.yml
groups:
- name: bizfirst.workflow.alerts
rules:
# Workflow error rate > 1% for 5 minutes
- alert: BizFirstWorkflowErrorRate
expr: |
(
sum(rate(bizfirst_workflow_executions_total{status="failed"}[5m])) by (tenant_id)
/
sum(rate(bizfirst_workflow_executions_total[5m])) by (tenant_id)
) > 0.01
for: 5m
labels:
severity: warning
product: flow-studio
annotations:
summary: "Workflow error rate high for tenant {{ $labels.tenant_id }}"
description: "Error rate is {{ printf \"%.2f\" ($value * 100) }}% (threshold: 1%)"
runbook: "https://docs.bizfirstai.com/runbooks/workflow-error-rate"
# P99 node latency > 5s for 5 minutes
- alert: BizFirstNodeHighLatency
expr: |
histogram_quantile(0.99,
sum(rate(bizfirst_node_execution_duration_seconds_bucket[5m])) by (node_type, le)
) > 5
for: 5m
labels:
severity: warning
product: flow-studio
annotations:
summary: "High P99 latency for node type {{ $labels.node_type }}"
description: "P99 latency is {{ printf \"%.1f\" $value }}s (threshold: 5s)"
# HIL backlog > 100 tasks
- alert: BizFirstHILBacklogHigh
expr: sum(bizfirst_hil_pending_count) by (tenant_id) > 100
for: 10m
labels:
severity: warning
team: operations
annotations:
summary: "HIL backlog high for tenant {{ $labels.tenant_id }}"
description: "{{ $value }} tasks awaiting human action"
# HIL overdue tasks (SLA breach)
- alert: BizFirstHILSLABreach
expr: sum(bizfirst_hil_overdue_count) by (tenant_id) > 0
for: 0m
labels:
severity: critical
team: operations
annotations:
summary: "HIL SLA breached for tenant {{ $labels.tenant_id }}"
# Service down
- alert: BizFirstServiceDown
expr: up{job=~"processengine|edgestream|octopus|api-gateway"} == 0
for: 1m
labels:
severity: critical
team: platform
annotations:
summary: "{{ $labels.job }} is DOWN"
# Disk running low on Prometheus
- alert: PrometheusStorageLow
expr: |
predict_linear(prometheus_tsdb_blocks_loaded[1h], 4*3600) > 0.9
for: 30m
labels:
severity: warning
team: platform
annotations:
summary: "Prometheus disk usage projected to exceed 90% in 4 hours"
Alert Severity Levels
| Severity | Meaning | Notification Channel | Response SLA |
|---|---|---|---|
| info | Informational — no action required immediately | Slack #platform-info | Next business day |
| warning | Degraded performance or approaching thresholds | Slack #platform-alerts | Within 4 hours |
| critical | Service impacted, SLA at risk | PagerDuty + Slack #incidents | Within 30 minutes |
The
for Duration Prevents Flapping
The for: 5m clause means the alert condition must be continuously true for 5 minutes before the alert fires. Without this, a momentary spike causes an alert that immediately resolves — generating unnecessary notifications ("flapping"). For production alerts, use for: 3m minimum. For critical infrastructure (service down), use for: 1m or even for: 0m for immediate notification.