Portal Community

Failure Scenarios

ScenarioSymptomDetectionRecovery
Double-resume409 Conflict on second resume callResumedAt is already setNo action needed — first resume succeeded
Token expired410 Gone on resume callExpiresAt in the pastIf configured, timeout job auto-handles. Otherwise, admin must manually resolve.
Orphaned suspensionTask never completed, no timeoutScan for suspensions with no ExpiresAt and Status=Pending older than N daysAdmin calls POST /api/hil/admin/force-resume with a manual decision
Suspension persisted, continuation failedResumedAt is set but workflow is not RunningExecution status shows Suspended even after ResumedAt is setHILResumeService logs the continuation error — retry via admin API
Non-serialisable output in memorySuspension fails with SerializationExceptionEngine logs the error before writing to DBFix the executor's output type — use serialisable DTOs

Admin Force-Resume API

// For use by tenant admins to unblock stuck suspensions
POST /api/hil/admin/executions/{executionResId}/force-resume
Authorization: Bearer {adminToken}

{
  "decision": "Approved",   // or the appropriate response data
  "adminNote": "Force-resolved by admin due to actor unavailability"
}

Cleanup Job for Orphaned Suspensions

// HILCleanupJob — runs on a configurable schedule
public async Task RunAsync(CancellationToken ct)
{
    var orphanThreshold = DateTimeOffset.UtcNow.AddDays(-_options.OrphanThresholdDays);
    var orphaned = await _repo.GetOrphanedAsync(orphanThreshold, ct);

    foreach (var suspension in orphaned)
    {
        await _alertService.NotifyAdminAsync(new OrphanedSuspensionAlert
        {
            ExecutionResId = suspension.ExecutionResId,
            SuspendedAt    = suspension.SuspendedAt,
            ProcessId      = suspension.ProcessId
        }, ct);
    }
}
Design for resilience: Always set a timeout on HIL nodes in production workflows. An unset timeout is a potential orphan waiting to happen. The HILCleanupJob will alert — but alert fatigue is real. Use timeouts to auto-resolve, not alerts to manually fix.