Node Metrics — Node Observability

Auto-Emitted Metrics

Metric Name	Type	Description
`node.execution.duration_ms`	Histogram	Time from start to completion (including retries)
`node.execution.retry_count`	Counter	Number of retries before success or final failure
`node.execution.result`	Counter	Incremented with label: success \| failed \| skipped

Metric Tags (Labels)

All auto-emitted metrics include these tags for filtering in Prometheus / Grafana:

// Tags on every auto-emitted metric:
{
  "nodeType"  : "ApprovalNode",
  "nodeId"    : "node-approval-1",
  "processId" : "proc-xyz",
  "tenantId"  : "tenant-001",
  "result"    : "success"  // on node.execution.result only
}

BaseNodeExecutor — Metric Emission Points

// BaseNodeExecutor.cs (simplified)
public async Task<NodeExecutionResult> ExecuteAsync(NodeExecutionContext ctx, CancellationToken ct)
{
    var sw = Stopwatch.StartNew();
    NodeExecutionResult result;

    try
    {
        result = await RunWithRetryAsync(ctx, ct);  // calls subclass ExecuteAsync
    }
    catch (Exception ex)
    {
        result = NodeExecutionResult.Fail(ex);
    }
    finally
    {
        sw.Stop();

        // Auto-emit after execution:
        ctx.Observability.Metrics.RecordHistogram(
            "node.execution.duration_ms", sw.ElapsedMilliseconds,
            GetTags(ctx));

        ctx.Observability.Metrics.IncrementCounter(
            "node.execution.result",
            GetTags(ctx, ("result", result.IsSuccess ? "success" : "failed")));
    }

    return result;
}

Grafana Dashboard Queries

# Average node execution duration by node type (Prometheus PromQL)
avg(node_execution_duration_ms{processId="proc-xyz"}) by (nodeType)

# Failure rate by node type
sum(rate(node_execution_result{result="failed"}[5m])) by (nodeType)

Executors do not need to time themselves. Do not add Stopwatch or DateTime.Now code inside your executor's ExecuteAsync method to measure duration — BaseNodeExecutor measures the full execution time including retries. Your manual timing would be redundant and potentially inaccurate (it would not include retry delay time).

← Structured Logs Next: OpenTelemetry Traces →