BizFirst Observe
Trace a Slow Node
Users are reporting that workflows are running slowly. You need to identify which node type is the bottleneck. This workflow takes you from the P99 latency metric in Prometheus, through an exemplar to a specific slow trace in Tempo, to the exact span that caused the delay.
Step 1: Identify the Slow Node Type in Prometheus
# In Grafana Explore — select Prometheus data source:
# Or open the Node Performance dashboard
# Find the top 5 slowest node types by P99 latency:
topk(5,
histogram_quantile(0.99,
sum(rate(bizfirst_node_execution_duration_seconds_bucket[15m])) by (node_type, le)
)
)
# Example output:
# HttpRequestNode: 28.5s (P99)
# OctopusNode: 12.3s (P99)
# DatabaseQueryNode: 0.8s (P99)
# Focus on the slowest: HttpRequestNode at 28.5s P99
Step 2: Jump to a Slow Trace via Exemplars
# In the Node Performance dashboard:
# 1. Find the "P99 Latency by Node Type" time series panel
# 2. Look for diamond-shaped exemplar markers on the chart
# (appears when a data point has an associated trace)
# 3. Click an exemplar diamond on the HttpRequestNode series
# 4. A popup shows the TraceId — click "View trace" to open in Tempo
# If exemplars are not visible:
# Check that Prometheus has exemplars enabled (requires Prometheus 2.25+):
# prometheus.yml: feature_flags: [exemplar-storage]
# And the BizFirstGO service emits exemplars with histogram observations
Step 3: Read the Trace Waterfall in Tempo
# In Grafana Explore — Tempo — after clicking the exemplar trace:
# The trace waterfall shows:
#
# workflow.execute (total: 32s)
# ├── node.execute: MapperNode (12ms)
# ├── node.execute: ValidationNode (8ms)
# ├── node.execute: HttpRequestNode (29.1s) ← SLOW
# │ └── http.client.request (28.9s) ← Waiting for external HTTP call
# └── node.execute: WriterNode (15ms)
#
# The HttpRequestNode span shows:
# Attribute: http.url = "https://external-api.example.com/v1/data"
# Attribute: http.status_code = 200
# Duration: 29.1 seconds — external API responded slowly
# Root cause: The slow node is waiting on an external API that takes ~29 seconds.
# Action: Check if the external API has SLA issues; consider adding a timeout.
Step 4: Search for All Slow Traces of This Type
# TraceQL query to find all executions where HttpRequestNode was slow:
# In Grafana Explore — Tempo — TraceQL mode:
{ span.node.type = "HttpRequestNode" && duration > 10s }
# This returns all traces containing a slow HttpRequestNode span (> 10s)
# Use the results to understand if this is isolated or systematic:
# - Isolated: one slow call → external API latency spike
# - Systematic: all calls slow → external API degradation, or configuration issue
# Find traces from the last hour:
{ rootName = "workflow.execute" && span.node.type = "HttpRequestNode" && duration > 5s }
| select(span.tenant.id, duration)
Metrics Identify the Problem, Traces Show the Root Cause
Prometheus metrics (P99 latency by node_type) tell you which node type is slow. A distributed trace tells you why — is it a slow external HTTP call, a slow database query, or a slow downstream service? Both signals are necessary for complete root cause analysis. Use metrics to find which node and traces to understand why.