BizFirst Observe
Error Analysis Workflow
Error analysis answers: what failed, how often, which tenants are affected, and what is the root cause? This workflow takes you from the error rate metric to error log patterns to a specific trace showing the exception — the complete chain of signals for a production incident.
Phase 1: Quantify the Error Rate
# In Grafana Explore — Prometheus — understand the scale of the problem:
# Current error rate across all tenants:
sum(rate(bizfirst_workflow_executions_total{status="failed"}[5m]))
/
sum(rate(bizfirst_workflow_executions_total[5m]))
* 100
# Error rate broken down by tenant — find which tenants are affected:
sum(rate(bizfirst_workflow_executions_total{status="failed"}[5m])) by (tenant_id)
/
sum(rate(bizfirst_workflow_executions_total[5m])) by (tenant_id)
* 100
# Error rate broken down by node type — find which node is failing:
sum(rate(bizfirst_node_execution_duration_seconds_count{status="error"}[5m])) by (node_type)
Phase 2: Find Error Log Patterns
# In Grafana Explore — Loki:
# Find all error logs in the last 15 minutes:
{job="processengine", environment="production", level="error"}
| json
| line_format "{{.timestamp}} [{{.nodeType}}] {{.message}}"
# Count error occurrences by message to find the most common error:
sum by (message) (
count_over_time(
{job="processengine", level="error"} | json [15m]
)
)
# Find the most common error node type:
{job="processengine", level="error"} | json
| unwrap nodeType
# Note: use LogQL parsing to group by nodeType field
Phase 3: Find the Exception Detail
# Drill into a specific error to see the exception:
{job="processengine", level="error"} | json
| message =~ ".*timeout.*"
| line_format "executionId={{.executionId}} node={{.nodeType}} error={{.exceptionMessage}}"
# Example output:
# executionId=exec-abc123 node=HttpRequestNode error=Request timeout after 30000ms
# executionId=exec-def456 node=HttpRequestNode error=Request timeout after 30000ms
# executionId=exec-ghi789 node=HttpRequestNode error=Connection refused
# Pattern: HttpRequestNode is failing with timeouts — external dependency issue
Phase 4: Correlate Error to Trace
# Find a specific error log line and its traceId:
{job="processengine", level="error"} |= "exec-abc123" | json
# In the log line, find:
# traceId: "4bf92f3577b34da6a3ce929d0e0e4736"
# Search for this trace in Tempo:
# Grafana Explore → Tempo → TraceId search
# Paste: 4bf92f3577b34da6a3ce929d0e0e4736
# In the trace, look for the error span:
# - Spans with status = "error" are highlighted in red
# - Click the error span to see: exception.type, exception.message, exception.stacktrace
# - The stacktrace shows exactly which line of code threw the exception
Phase 5: Determine Blast Radius
# How many executions are currently failing?
sum(increase(bizfirst_workflow_executions_total{status="failed"}[1h]))
# Which tenants are affected?
sum by (tenant_id) (increase(bizfirst_workflow_executions_total{status="failed"}[1h]))
> 0
# Is this affecting a specific workflow type?
sum by (workflow_type) (
rate(bizfirst_workflow_executions_total{status="failed"}[5m])
)
# How long has this been happening? (look at historical error rate)
# In Grafana: Flow Studio Overview dashboard → Error Rate panel → Zoom out to 6h view
# Find when the error rate started climbing
Document Your Findings as You Go
During an active incident, paste your key queries and findings into a shared incident document as you discover them. Other engineers joining the incident can immediately see what you've found without repeating your steps. Include: the PromQL query that showed the error rate, the LogQL query that found the error pattern, the TraceId of the representative trace, and your current hypothesis for root cause.