Flow Studio
Failure Modes
Common failure scenarios in the HIL suspension/resume cycle, their causes, and how to detect and recover from them.
Failure Scenarios
| Scenario | Symptom | Detection | Recovery |
|---|---|---|---|
| Double-resume | 409 Conflict on second resume call | ResumedAt is already set | No action needed — first resume succeeded |
| Token expired | 410 Gone on resume call | ExpiresAt in the past | If configured, timeout job auto-handles. Otherwise, admin must manually resolve. |
| Orphaned suspension | Task never completed, no timeout | Scan for suspensions with no ExpiresAt and Status=Pending older than N days | Admin calls POST /api/hil/admin/force-resume with a manual decision |
| Suspension persisted, continuation failed | ResumedAt is set but workflow is not Running | Execution status shows Suspended even after ResumedAt is set | HILResumeService logs the continuation error — retry via admin API |
| Non-serialisable output in memory | Suspension fails with SerializationException | Engine logs the error before writing to DB | Fix the executor's output type — use serialisable DTOs |
Admin Force-Resume API
// For use by tenant admins to unblock stuck suspensions
POST /api/hil/admin/executions/{executionResId}/force-resume
Authorization: Bearer {adminToken}
{
"decision": "Approved", // or the appropriate response data
"adminNote": "Force-resolved by admin due to actor unavailability"
}
Cleanup Job for Orphaned Suspensions
// HILCleanupJob — runs on a configurable schedule
public async Task RunAsync(CancellationToken ct)
{
var orphanThreshold = DateTimeOffset.UtcNow.AddDays(-_options.OrphanThresholdDays);
var orphaned = await _repo.GetOrphanedAsync(orphanThreshold, ct);
foreach (var suspension in orphaned)
{
await _alertService.NotifyAdminAsync(new OrphanedSuspensionAlert
{
ExecutionResId = suspension.ExecutionResId,
SuspendedAt = suspension.SuspendedAt,
ProcessId = suspension.ProcessId
}, ct);
}
}
Design for resilience: Always set a timeout on HIL nodes in production workflows. An unset timeout is a potential orphan waiting to happen. The HILCleanupJob will alert — but alert fatigue is real. Use timeouts to auto-resolve, not alerts to manually fix.