Responding to Alerts
When a Grafana alert fires, you receive a Slack message or PagerDuty notification with a direct link into Grafana. This page describes how to read the alert context, navigate to the relevant dashboard, and begin investigation — the first 5 minutes of alert response.
Anatomy of a BizFirstGO Alert Notification
# Example Slack alert message format:
ALERT: WorkflowErrorRateHigh
Summary: Workflow error rate has exceeded 5% for 5 minutes
Severity: critical
Environment: production
Current Value: 8.3%
Time: 2025-05-25 14:35:00 UTC
[View in Grafana] → link to Flow Studio Overview dashboard
[View Alert] → link to Grafana Alert detail page
[Silence] → link to create a silence (use carefully)
Labels:
alertname: WorkflowErrorRateHigh
environment: production
severity: critical
First 5 Minutes: Alert Response Procedure
Click "View in Grafana" and acknowledge
The link opens the Flow Studio Overview dashboard scoped to the production environment. Immediately assess: Is the error rate still rising? When did it start (look at the Error Rate time series)? Is it all tenants or one?
Set the $tenant variable to "All" and look for the spike
On the Flow Studio Overview dashboard, the Error Rate panel by tenant shows which tenants are affected. If the spike is in one tenant — scoped incident. If all tenants — systemic issue with the engine or infrastructure.
Find the error pattern in Loki
Open Grafana Explore (Loki). Query: {job="processengine", level="error"} | json for the last 15 minutes. Look at the error messages — what is failing? Is it a specific node type, a specific external service, or a configuration error?
Check the Node Performance dashboard for the bottleneck
Navigate to the Node Performance dashboard. The "Error Rate by Node Type" bar chart shows which node is failing. The "Slowest Nodes (Top 10)" table confirms if there is a latency issue as well.
Decide: mitigate or investigate further
If root cause is clear (e.g., "external API is down") — initiate mitigation (circuit breaker, fallback). If root cause is unclear — continue investigation using the Error Analysis workflow. Post status update in the incident channel before 10 minutes have elapsed.
Alert-Specific Response Guides
| Alert | First Query to Run | Likely Causes |
|---|---|---|
| WorkflowErrorRateHigh | Loki: {job="processengine", level="error"} | json | nodeType != "" | External API down, invalid workflow config, database connection failure |
| WorkflowP99LatencyHigh | PromQL: topk(5, histogram_quantile(0.99, ...)) by node_type | Slow external dependency, database query slowdown, OTel Collector overloaded |
| HILSLABreached | Loki: {job="processengine"} | json | hilStatus="overdue" | Approvers not responding — notify business team, not engineering |
| ProcessEngineDown | Prometheus targets: check if processengine is "up" | Service crash, OOM kill, deployment failure |
If you are performing planned maintenance that will cause alerts to fire, create a Grafana silence before starting. Navigate to Alerting → Silences → Add Silence. Set the matchers to the relevant alert (e.g., environment = staging) and the duration to cover the maintenance window plus 30 minutes buffer. Always add a reason comment explaining why the silence was created.