Portal Community

Recommended Alert Rules

HighInteractionTimeoutRate

Severity: critical For: 5m
expr: |
  100 * (
    sum(rate(interaction_timeout_total[5m]))
    /
    sum(rate(interaction_published_total[5m]))
  ) > 15
annotations:
  summary: "Interaction timeout rate critically high ({{ $value | printf \"%.1f\" }}%)"
  description: "More than 15% of interactions are timing out. This indicates a delivery failure or systemic user absence."

ElevatedInteractionTimeoutRate

Severity: warning For: 5m
expr: |
  100 * (
    sum(rate(interaction_timeout_total[5m]))
    /
    sum(rate(interaction_published_total[5m]))
  ) > 5
annotations:
  summary: "Interaction timeout rate elevated ({{ $value | printf \"%.1f\" }}%)"
  description: "More than 5% of interactions are timing out. Monitor for further increase."

SlowApprovalResponseTime

Severity: warning For: 15m
expr: |
  histogram_quantile(0.95,
    sum(rate(interaction_response_time_ms_bucket{type="approval"}[15m]))
    by (le)
  ) > 14400000  # 4 hours in ms
annotations:
  summary: "Approval response P95 exceeds 4 hours"
  description: "95% of approvals are taking longer than 4 hours. Review pending approvals and escalation policy."

GrowingInFlightInteractions

Severity: warning For: 10m
expr: |
  sum(interaction_in_flight) > 100
annotations:
  summary: "More than 100 interactions are currently in-flight"
  description: "A large number of interactions are pending without response. Check InteractionMonitor and delivery pipeline."

HighBlockedInteractionRate

Severity: warning For: 5m
expr: |
  100 * (
    sum(rate(interaction_blocked_total[5m]))
    /
    (sum(rate(interaction_published_total[5m])) + sum(rate(interaction_blocked_total[5m])))
  ) > 10
labels:
  team: platform
annotations:
  summary: "More than 10% of interactions are being blocked by pre-send hooks"
  description: "Review rate limit or pre-send hook configuration. Hook blocking by: {{ $labels.blocked_by }}"

Alert Rules YAML — Full File

groups:
  - name: edge-interact
    rules:
      - alert: HighInteractionTimeoutRate
        expr: |
          100 * (sum(rate(interaction_timeout_total[5m])) / sum(rate(interaction_published_total[5m]))) > 15
        for: 5m
        labels: { severity: critical, team: platform }
        annotations:
          summary: "Interaction timeout rate critically high"

      - alert: ElevatedInteractionTimeoutRate
        expr: |
          100 * (sum(rate(interaction_timeout_total[5m])) / sum(rate(interaction_published_total[5m]))) > 5
        for: 5m
        labels: { severity: warning, team: platform }
        annotations:
          summary: "Interaction timeout rate elevated"

      - alert: SlowApprovalResponseTime
        expr: |
          histogram_quantile(0.95, sum(rate(interaction_response_time_ms_bucket{type="approval"}[15m])) by (le)) > 14400000
        for: 15m
        labels: { severity: warning, team: platform }
        annotations:
          summary: "Approval P95 response time exceeds 4h"

      - alert: GrowingInFlightInteractions
        expr: sum(interaction_in_flight) > 100
        for: 10m
        labels: { severity: warning, team: platform }
        annotations:
          summary: "More than 100 in-flight interactions"

      - alert: HighBlockedInteractionRate
        expr: |
          100 * (sum(rate(interaction_blocked_total[5m])) / (sum(rate(interaction_published_total[5m])) + sum(rate(interaction_blocked_total[5m])))) > 10
        for: 5m
        labels: { severity: warning, team: platform }
        annotations:
          summary: "High interaction block rate"

Alert Routing

In AlertManager, route EdgeInteract alerts to the platform team Slack channel and PagerDuty for critical severity:

# alertmanager.yml
route:
  routes:
    - match:
        team: platform
        severity: critical
      receiver: pagerduty-platform
    - match:
        team: platform
      receiver: slack-platform

receivers:
  - name: pagerduty-platform
    pagerduty_configs:
      - routing_key: "<PAGERDUTY_KEY>"
        description: "{{ .CommonAnnotations.summary }}"

  - name: slack-platform
    slack_configs:
      - api_url: "<SLACK_WEBHOOK_URL>"
        channel: "#platform-alerts"
        text: "{{ .CommonAnnotations.summary }}\n{{ .CommonAnnotations.description }}"
Alert Requires MetricsInteractionHook None of these alerts will fire if MetricsInteractionHook is not registered. Verify with curl /metrics | grep interaction_ before configuring alert rules in production.

Alert Threshold Reference

AlertThresholdWindowSeverity
Timeout rate> 5%5mWarning
Timeout rate> 15%5mCritical
Approval P95> 4h15mWarning
In-flight count> 10010mWarning
Blocked rate> 10%5mWarning