Failure Modes — HIL Suspension & Resume

Failure Scenarios

Scenario	Symptom	Detection	Recovery
Double-resume	409 Conflict on second resume call	ResumedAt is already set	No action needed — first resume succeeded
Token expired	410 Gone on resume call	ExpiresAt in the past	If configured, timeout job auto-handles. Otherwise, admin must manually resolve.
Orphaned suspension	Task never completed, no timeout	Scan for suspensions with no ExpiresAt and Status=Pending older than N days	Admin calls `POST /api/hil/admin/force-resume` with a manual decision
Suspension persisted, continuation failed	ResumedAt is set but workflow is not Running	Execution status shows Suspended even after ResumedAt is set	HILResumeService logs the continuation error — retry via admin API
Non-serialisable output in memory	Suspension fails with SerializationException	Engine logs the error before writing to DB	Fix the executor's output type — use serialisable DTOs

Admin Force-Resume API

// For use by tenant admins to unblock stuck suspensions
POST /api/hil/admin/executions/{executionResId}/force-resume
Authorization: Bearer {adminToken}

{
  "decision": "Approved",   // or the appropriate response data
  "adminNote": "Force-resolved by admin due to actor unavailability"
}

Cleanup Job for Orphaned Suspensions

// HILCleanupJob — runs on a configurable schedule
public async Task RunAsync(CancellationToken ct)
{
    var orphanThreshold = DateTimeOffset.UtcNow.AddDays(-_options.OrphanThresholdDays);
    var orphaned = await _repo.GetOrphanedAsync(orphanThreshold, ct);

    foreach (var suspension in orphaned)
    {
        await _alertService.NotifyAdminAsync(new OrphanedSuspensionAlert
        {
            ExecutionResId = suspension.ExecutionResId,
            SuspendedAt    = suspension.SuspendedAt,
            ProcessId      = suspension.ProcessId
        }, ct);
    }
}

Design for resilience: Always set a timeout on HIL nodes in production workflows. An unset timeout is a potential orphan waiting to happen. The HILCleanupJob will alert — but alert fatigue is real. Use timeouts to auto-resolve, not alerts to manually fix.

← Resuming the Engine