Error Analysis Workflow — Using the Tools

Phase 1: Quantify the Error Rate

# In Grafana Explore — Prometheus — understand the scale of the problem:

# Current error rate across all tenants:
sum(rate(bizfirst_workflow_executions_total{status="failed"}[5m]))
  /
sum(rate(bizfirst_workflow_executions_total[5m]))
* 100

# Error rate broken down by tenant — find which tenants are affected:
sum(rate(bizfirst_workflow_executions_total{status="failed"}[5m])) by (tenant_id)
  /
sum(rate(bizfirst_workflow_executions_total[5m])) by (tenant_id)
* 100

# Error rate broken down by node type — find which node is failing:
sum(rate(bizfirst_node_execution_duration_seconds_count{status="error"}[5m])) by (node_type)

Phase 2: Find Error Log Patterns

# In Grafana Explore — Loki:

# Find all error logs in the last 15 minutes:
{job="processengine", environment="production", level="error"}
  | json
  | line_format "{{.timestamp}} [{{.nodeType}}] {{.message}}"

# Count error occurrences by message to find the most common error:
sum by (message) (
  count_over_time(
    {job="processengine", level="error"} | json [15m]
  )
)

# Find the most common error node type:
{job="processengine", level="error"} | json
  | unwrap nodeType
# Note: use LogQL parsing to group by nodeType field

Phase 3: Find the Exception Detail

# Drill into a specific error to see the exception:
{job="processengine", level="error"} | json
  | message =~ ".*timeout.*"
  | line_format "executionId={{.executionId}} node={{.nodeType}} error={{.exceptionMessage}}"

# Example output:
# executionId=exec-abc123 node=HttpRequestNode error=Request timeout after 30000ms
# executionId=exec-def456 node=HttpRequestNode error=Request timeout after 30000ms
# executionId=exec-ghi789 node=HttpRequestNode error=Connection refused

# Pattern: HttpRequestNode is failing with timeouts — external dependency issue

Phase 4: Correlate Error to Trace

# Find a specific error log line and its traceId:
{job="processengine", level="error"} |= "exec-abc123" | json

# In the log line, find:
# traceId: "4bf92f3577b34da6a3ce929d0e0e4736"

# Search for this trace in Tempo:
# Grafana Explore → Tempo → TraceId search
# Paste: 4bf92f3577b34da6a3ce929d0e0e4736

# In the trace, look for the error span:
# - Spans with status = "error" are highlighted in red
# - Click the error span to see: exception.type, exception.message, exception.stacktrace
# - The stacktrace shows exactly which line of code threw the exception

Phase 5: Determine Blast Radius

# How many executions are currently failing?
sum(increase(bizfirst_workflow_executions_total{status="failed"}[1h]))

# Which tenants are affected?
sum by (tenant_id) (increase(bizfirst_workflow_executions_total{status="failed"}[1h]))
  > 0

# Is this affecting a specific workflow type?
sum by (workflow_type) (
  rate(bizfirst_workflow_executions_total{status="failed"}[5m])
)

# How long has this been happening? (look at historical error rate)
# In Grafana: Flow Studio Overview dashboard → Error Rate panel → Zoom out to 6h view
# Find when the error rate started climbing

Document Your Findings as You Go

During an active incident, paste your key queries and findings into a shared incident document as you discover them. Other engineers joining the incident can immediately see what you've found without repeating your steps. Include: the PromQL query that showed the error rate, the LogQL query that found the error pattern, the TraceId of the representative trace, and your current hypothesis for root cause.

← Trace a Slow Node Next: Monitoring HIL Backlog →