Monitoring HIL Backlog
Human-in-the-Loop (HIL) workflows pause execution waiting for human approval. A growing backlog means approvers are not keeping up with incoming tasks — which can cascade into SLA breaches. The HIL Analytics dashboard provides real-time and historical visibility into backlog health.
The HIL Analytics Dashboard
Open the HIL Analytics dashboard from Grafana Dashboards → BizFirstGO folder. Key panels:
| Panel | Healthy Value | Action If Unhealthy |
|---|---|---|
| Current Backlog (Gauge) | Green: < 50 tasks | Yellow (>50): alert approver managers; Red (>100): escalate |
| Overdue Tasks (Stat) | 0 (zero) | Any value > 0: immediate action — SLA breached |
| Approval Rate (Pie chart) | > 80% approved | High rejection rate may indicate process issues |
| Backlog by Tenant (Bar chart) | Even distribution | Single tenant with very high backlog: notify that tenant's admin |
| Suspension Duration (Histogram) | Most tasks < 4 hours | Long tail (> 24h): identify and escalate overdue tasks |
HIL Backlog Queries
# Current pending HIL task count:
sum(bizfirst_hil_pending_count)
# Pending tasks broken down by tenant:
sum by (tenant_id) (bizfirst_hil_pending_count)
# Overdue tasks (past SLA deadline):
sum(bizfirst_hil_overdue_count)
# Overdue tasks by tenant (to find which tenant has the worst SLA breach):
sum by (tenant_id) (bizfirst_hil_overdue_count) > 0
# HIL task approval rate over the last hour:
sum(rate(bizfirst_hil_completed_total{outcome="approved"}[1h]))
/
sum(rate(bizfirst_hil_completed_total[1h]))
* 100
# Average time-to-completion for HIL tasks (last 24 hours):
histogram_quantile(0.50,
sum(rate(bizfirst_hil_suspension_duration_seconds_bucket[24h])) by (le)
)
Finding Specific Overdue Tasks
# Use Loki to find which specific tasks are overdue:
{job="processengine"} | json | hilStatus = "overdue"
| line_format "taskId={{.hilTaskId}} tenant={{.tenantId}} deadline={{.slaDeadline}}"
# Find HIL tasks for a specific workflow type that are overdue:
{job="processengine"} | json | hilStatus = "overdue" | workflowType = "expense-approval"
# Find tasks assigned to a specific role that are pending:
{job="processengine"} | json | hilStatus = "pending" | roleRequired = "FinanceManager"
| line_format "taskId={{.hilTaskId}} pending since {{.suspendedAt}}"
Setting Up HIL Backlog Alerts
# The pre-built alert rules include two HIL alerts:
# 1. HILBacklogHigh — fires when backlog > 100 tasks for 15 minutes:
- alert: HILBacklogHigh
expr: sum(bizfirst_hil_pending_count) > 100
for: 15m
labels:
severity: warning
annotations:
summary: "HIL backlog is high ({{ $value }} pending tasks)"
# 2. HILSLABreached — fires immediately when any task is overdue:
- alert: HILSLABreached
expr: sum(bizfirst_hil_overdue_count) > 0
for: 1m
labels:
severity: critical
annotations:
summary: "{{ $value }} HIL tasks have breached their SLA deadline"
A firing HILSLABreached alert cannot be resolved by the engineering team — it requires the business process owners to approve or escalate the overdue tasks. When this alert fires, the on-call engineer's job is to: (1) identify which tenant and which tasks are overdue (use the queries above), (2) notify the appropriate business team or process owner, and (3) document the breach for compliance reporting.