Responding to Alerts — Using the Tools

Anatomy of a BizFirst Alert Notification

# Example Slack alert message format:
ALERT: WorkflowErrorRateHigh

Summary: Workflow error rate has exceeded 5% for 5 minutes
Severity: critical
Environment: production
Current Value: 8.3%
Time: 2025-05-25 14:35:00 UTC

[View in Grafana] → link to Flow Studio Overview dashboard
[View Alert]     → link to Grafana Alert detail page
[Silence]        → link to create a silence (use carefully)

Labels:
  alertname: WorkflowErrorRateHigh
  environment: production
  severity: critical

First 5 Minutes: Alert Response Procedure

Click "View in Grafana" and acknowledge

The link opens the Flow Studio Overview dashboard scoped to the production environment. Immediately assess: Is the error rate still rising? When did it start (look at the Error Rate time series)? Is it all tenants or one?

Set the $tenant variable to "All" and look for the spike

On the Flow Studio Overview dashboard, the Error Rate panel by tenant shows which tenants are affected. If the spike is in one tenant — scoped incident. If all tenants — systemic issue with the engine or infrastructure.

Find the error pattern in Loki

Open Grafana Explore (Loki). Query: {job="processengine", level="error"} | json for the last 15 minutes. Look at the error messages — what is failing? Is it a specific node type, a specific external service, or a configuration error?

Check the Node Performance dashboard for the bottleneck

Navigate to the Node Performance dashboard. The "Error Rate by Node Type" bar chart shows which node is failing. The "Slowest Nodes (Top 10)" table confirms if there is a latency issue as well.

Decide: mitigate or investigate further

If root cause is clear (e.g., "external API is down") — initiate mitigation (circuit breaker, fallback). If root cause is unclear — continue investigation using the Error Analysis workflow. Post status update in the incident channel before 10 minutes have elapsed.

Alert-Specific Response Guides

Alert	First Query to Run	Likely Causes
WorkflowErrorRateHigh	Loki: `{job="processengine", level="error"} \| json \| nodeType != ""`	External API down, invalid workflow config, database connection failure
WorkflowP99LatencyHigh	PromQL: `topk(5, histogram_quantile(0.99, ...))` by node_type	Slow external dependency, database query slowdown, OTel Collector overloaded
HILSLABreached	Loki: `{job="processengine"} \| json \| hilStatus="overdue"`	Approvers not responding — notify business team, not engineering
ProcessEngineDown	Prometheus targets: check if processengine is "up"	Service crash, OOM kill, deployment failure

Create a Silence During Planned Maintenance

If you are performing planned maintenance that will cause alerts to fire, create a Grafana silence before starting. Navigate to Alerting → Silences → Add Silence. Set the matchers to the relevant alert (e.g., environment = staging) and the duration to cover the maintenance window plus 30 minutes buffer. Always add a reason comment explaining why the silence was created.

← Monitoring HIL Backlog Next: Querying as a Tenant Admin →