EdgeInteract
Alerting
EdgeInteract alert rules are configured in BizFirst Observe (Grafana AlertManager) using Prometheus expressions on the metrics emitted by MetricsInteractionHook. This page defines the recommended alert rules, their thresholds, and alert routing configuration.
Recommended Alert Rules
HighInteractionTimeoutRate
expr: |
100 * (
sum(rate(interaction_timeout_total[5m]))
/
sum(rate(interaction_published_total[5m]))
) > 15
annotations:
summary: "Interaction timeout rate critically high ({{ $value | printf \"%.1f\" }}%)"
description: "More than 15% of interactions are timing out. This indicates a delivery failure or systemic user absence."
ElevatedInteractionTimeoutRate
expr: |
100 * (
sum(rate(interaction_timeout_total[5m]))
/
sum(rate(interaction_published_total[5m]))
) > 5
annotations:
summary: "Interaction timeout rate elevated ({{ $value | printf \"%.1f\" }}%)"
description: "More than 5% of interactions are timing out. Monitor for further increase."
SlowApprovalResponseTime
expr: |
histogram_quantile(0.95,
sum(rate(interaction_response_time_ms_bucket{type="approval"}[15m]))
by (le)
) > 14400000 # 4 hours in ms
annotations:
summary: "Approval response P95 exceeds 4 hours"
description: "95% of approvals are taking longer than 4 hours. Review pending approvals and escalation policy."
GrowingInFlightInteractions
expr: |
sum(interaction_in_flight) > 100
annotations:
summary: "More than 100 interactions are currently in-flight"
description: "A large number of interactions are pending without response. Check InteractionMonitor and delivery pipeline."
HighBlockedInteractionRate
expr: |
100 * (
sum(rate(interaction_blocked_total[5m]))
/
(sum(rate(interaction_published_total[5m])) + sum(rate(interaction_blocked_total[5m])))
) > 10
labels:
team: platform
annotations:
summary: "More than 10% of interactions are being blocked by pre-send hooks"
description: "Review rate limit or pre-send hook configuration. Hook blocking by: {{ $labels.blocked_by }}"
Alert Rules YAML — Full File
groups:
- name: edge-interact
rules:
- alert: HighInteractionTimeoutRate
expr: |
100 * (sum(rate(interaction_timeout_total[5m])) / sum(rate(interaction_published_total[5m]))) > 15
for: 5m
labels: { severity: critical, team: platform }
annotations:
summary: "Interaction timeout rate critically high"
- alert: ElevatedInteractionTimeoutRate
expr: |
100 * (sum(rate(interaction_timeout_total[5m])) / sum(rate(interaction_published_total[5m]))) > 5
for: 5m
labels: { severity: warning, team: platform }
annotations:
summary: "Interaction timeout rate elevated"
- alert: SlowApprovalResponseTime
expr: |
histogram_quantile(0.95, sum(rate(interaction_response_time_ms_bucket{type="approval"}[15m])) by (le)) > 14400000
for: 15m
labels: { severity: warning, team: platform }
annotations:
summary: "Approval P95 response time exceeds 4h"
- alert: GrowingInFlightInteractions
expr: sum(interaction_in_flight) > 100
for: 10m
labels: { severity: warning, team: platform }
annotations:
summary: "More than 100 in-flight interactions"
- alert: HighBlockedInteractionRate
expr: |
100 * (sum(rate(interaction_blocked_total[5m])) / (sum(rate(interaction_published_total[5m])) + sum(rate(interaction_blocked_total[5m])))) > 10
for: 5m
labels: { severity: warning, team: platform }
annotations:
summary: "High interaction block rate"
Alert Routing
In AlertManager, route EdgeInteract alerts to the platform team Slack channel and PagerDuty for critical severity:
# alertmanager.yml
route:
routes:
- match:
team: platform
severity: critical
receiver: pagerduty-platform
- match:
team: platform
receiver: slack-platform
receivers:
- name: pagerduty-platform
pagerduty_configs:
- routing_key: "<PAGERDUTY_KEY>"
description: "{{ .CommonAnnotations.summary }}"
- name: slack-platform
slack_configs:
- api_url: "<SLACK_WEBHOOK_URL>"
channel: "#platform-alerts"
text: "{{ .CommonAnnotations.summary }}\n{{ .CommonAnnotations.description }}"
Alert Requires MetricsInteractionHook
None of these alerts will fire if
MetricsInteractionHook is not registered. Verify with curl /metrics | grep interaction_ before configuring alert rules in production.
Alert Threshold Reference
| Alert | Threshold | Window | Severity |
|---|---|---|---|
| Timeout rate | > 5% | 5m | Warning |
| Timeout rate | > 15% | 5m | Critical |
| Approval P95 | > 4h | 15m | Warning |
| In-flight count | > 100 | 10m | Warning |
| Blocked rate | > 10% | 5m | Warning |