Monitoring Timeouts — HIL Timeout

HILExpiredEvent

Regardless of the configured timeout behavior (Escalate, AutoApprove, AutoReject, or Fail), the dispatcher always publishes an HILExpiredEvent after handling the timeout:

public class HILExpiredEvent : IWorkflowEvent
{
    public string   ExecutionResId  { get; init; }
    public Guid     ExecutionId     { get; init; }
    public string   TenantId        { get; init; }
    public string   NodeId          { get; init; }
    public string   TimeoutBehavior { get; init; }  // Escalate | AutoApprove | AutoReject | Fail
    public DateTimeOffset ExpiredAt { get; init; }
    public string   OriginalActorId { get; init; }
    public string?  EscalationActorId { get; init; }  // set only for Escalate
}

// Published in HILTimeoutDispatcher after every behavior handler runs
await _eventBus.PublishAsync(new HILExpiredEvent { ... }, ct);

Observer Panel — Timeout Events

The flowObserverPanelStore subscribes to HIL events streamed via SignalR and renders them in the Observer Panel timeline:

// flowObserverPanelStore.ts
connection.on("HILExpired", (event: HILExpiredEvent) => {
    useFlowObserverPanelStore.getState().addEvent({
        type      : "hil-timeout",
        nodeId    : event.nodeId,
        timestamp : event.expiredAt,
        label     : `HIL timeout — ${event.timeoutBehavior}`,
        actor     : event.originalActorId,
        severity  : event.timeoutBehavior === "Fail" ? "error" : "warning"
    });
});

Audit Log Entry Schema

Timeout events are written to Process_AuditLog alongside the suspension record:

// AuditLog entry written by HILTimeoutDispatcher
{
  "auditId"        : "7a3e2b1c-...",
  "tenantId"       : "tenant-001",
  "executionId"    : "exec-abc",
  "executionResId" : "wf-run-xyz",
  "nodeId"         : "node-approval-1",
  "eventType"      : "HILTimeout",
  "behavior"       : "Escalate",
  "originalActor"  : "user-john",
  "escalationActor": "user-manager",
  "expiredAt"      : "2026-05-25T18:00:00Z",
  "recordedAt"     : "2026-05-25T18:00:01Z"
}

Suspended Executions — Monitoring Query

Operations teams can query the suspended executions table to surface overdue and at-risk tasks before timeout fires:

-- Find tasks expiring within the next hour (at-risk)
SELECT ExecutionResId, SuspendedNodeId, ActorId, ExpiresAt,
       DATEDIFF(MINUTE, GETUTCDATE(), ExpiresAt) AS MinutesRemaining
FROM   Process_SuspendedExecutions
WHERE  Status   = 0   -- Pending
  AND  ExpiresAt BETWEEN GETUTCDATE() AND DATEADD(HOUR, 1, GETUTCDATE())
ORDER BY ExpiresAt ASC;

-- Find tasks that have already timed out but not yet processed
SELECT ExecutionResId, SuspendedNodeId, ActorId, ExpiresAt,
       TimeoutBehavior
FROM   Process_SuspendedExecutions
WHERE  Status   = 0   -- still Pending
  AND  ExpiresAt < GETUTCDATE()
ORDER BY ExpiresAt ASC;

Timeout Metrics

The node observability layer emits timeout counters via INodeMetrics:

// Recorded in HILTimeoutDispatcher
_metrics.IncrementCounter(
    "hil.timeout.total",
    tags: new Dictionary<string, string>
    {
        ["behavior"] = suspension.TimeoutBehavior,
        ["nodeId"]   = suspension.SuspendedNodeId,
        ["tenantId"] = suspension.TenantId
    });

// Key metrics to alert on:
// hil.timeout.total{behavior="Fail"}     — terminal failures
// hil.timeout.total{behavior="Escalate"} — escalation volume
// hil.timeout.job.batch_size             — job processing throughput
// hil.timeout.job.duration_ms            — job execution latency

Alerting Recommendations

Metric / Condition	Alert Threshold	Action
hil.timeout.total{behavior="Fail"} rate	> 5/hour	Ops review — workflows terminating unexpectedly
Pending suspensions past expiry	> 0 for > 15 min	Check HILTimeoutJob is running (Hangfire)
hil.timeout.job.duration_ms	> 10 s	BatchSize may be too large; reduce in config
HILExpiredEvent rate spike	3x baseline	Review actor availability and deadline settings

Hangfire Job Health

The HILTimeoutJob runs on a schedule via Hangfire. Verify it is running from the Hangfire dashboard:

// Hangfire registration — typical recurrence
RecurringJob.AddOrUpdate<HILTimeoutJob>(
    "hil-timeout-scan",
    job => job.ExecuteAsync(CancellationToken.None),
    Cron.Minutely());

// If the job is missing from Hangfire recurring jobs list,
// re-register by restarting the API or calling the admin endpoint:
// POST /admin/jobs/hil-timeout/register

Observer Panel shortcut: In Flow Studio, open the Observer Panel (keyboard shortcut Shift+O) and filter by event type hil-timeout to see all timeout events for the current execution. Each event shows the node, behavior taken, actor, and timestamp.

Cascading timeouts: If a workflow has multiple HIL nodes in series and all actors are unresponsive, you may see a cascade of timeout events. Set up a composite alert that fires when the same execution emits more than one HILExpiredEvent within a short window — this signals a systemic actor availability problem rather than an isolated incident.

← No Action (Fail) Back to Portal →