Trace a Slow Node — Using the Tools

Step 1: Identify the Slow Node Type in Prometheus

# In Grafana Explore — select Prometheus data source:
# Or open the Node Performance dashboard

# Find the top 5 slowest node types by P99 latency:
topk(5,
  histogram_quantile(0.99,
    sum(rate(bizfirst_node_execution_duration_seconds_bucket[15m])) by (node_type, le)
  )
)

# Example output:
# HttpRequestNode: 28.5s (P99)
# OctopusNode: 12.3s (P99)
# DatabaseQueryNode: 0.8s (P99)

# Focus on the slowest: HttpRequestNode at 28.5s P99

Step 2: Jump to a Slow Trace via Exemplars

# In the Node Performance dashboard:
# 1. Find the "P99 Latency by Node Type" time series panel
# 2. Look for diamond-shaped exemplar markers on the chart
#    (appears when a data point has an associated trace)
# 3. Click an exemplar diamond on the HttpRequestNode series
# 4. A popup shows the TraceId — click "View trace" to open in Tempo

# If exemplars are not visible:
# Check that Prometheus has exemplars enabled (requires Prometheus 2.25+):
# prometheus.yml: feature_flags: [exemplar-storage]
# And the BizFirst service emits exemplars with histogram observations

Step 3: Read the Trace Waterfall in Tempo

# In Grafana Explore — Tempo — after clicking the exemplar trace:
# The trace waterfall shows:
#
# workflow.execute (total: 32s)
#  ├── node.execute: MapperNode (12ms)
#  ├── node.execute: ValidationNode (8ms)
#  ├── node.execute: HttpRequestNode (29.1s)  ← SLOW
#  │    └── http.client.request (28.9s)       ← Waiting for external HTTP call
#  └── node.execute: WriterNode (15ms)
#
# The HttpRequestNode span shows:
# Attribute: http.url = "https://external-api.example.com/v1/data"
# Attribute: http.status_code = 200
# Duration: 29.1 seconds — external API responded slowly

# Root cause: The slow node is waiting on an external API that takes ~29 seconds.
# Action: Check if the external API has SLA issues; consider adding a timeout.

Step 4: Search for All Slow Traces of This Type

# TraceQL query to find all executions where HttpRequestNode was slow:
# In Grafana Explore — Tempo — TraceQL mode:

{ span.node.type = "HttpRequestNode" && duration > 10s }

# This returns all traces containing a slow HttpRequestNode span (> 10s)
# Use the results to understand if this is isolated or systematic:
# - Isolated: one slow call → external API latency spike
# - Systematic: all calls slow → external API degradation, or configuration issue

# Find traces from the last hour:
{ rootName = "workflow.execute" && span.node.type = "HttpRequestNode" && duration > 5s }
  | select(span.tenant.id, duration)

Metrics Identify the Problem, Traces Show the Root Cause

Prometheus metrics (P99 latency by node_type) tell you which node type is slow. A distributed trace tells you why — is it a slow external HTTP call, a slow database query, or a slow downstream service? Both signals are necessary for complete root cause analysis. Use metrics to find which node and traces to understand why.

← Find Logs for an Execution Next: Error Analysis Workflow →