Verify Ingestion — Setup Guide

Step 1: Trigger a Test Execution

# Trigger a minimal workflow execution via the ProcessEngine API:
curl -X POST http://processengine:8080/api/workflow/execute \
  -H "Content-Type: application/json" \
  -H "X-Tenant-ID: test-tenant" \
  -d '{
    "workflowId": "wf-test-observability",
    "input": { "testMode": true }
  }'

# Note the executionId from the response:
# {"executionId": "exec-a1b2c3d4", "status": "started"}

# Save it for the queries below:
export EXEC_ID="exec-a1b2c3d4"

Step 2: Verify Logs in Loki

# Query Loki via API to verify log ingestion:
curl -G "http://localhost:3100/loki/api/v1/query_range" \
  --data-urlencode 'query={job="processengine"} |= "'"$EXEC_ID"'"' \
  --data-urlencode 'start='"$(date -d '5 minutes ago' +%s)"'000000000' \
  --data-urlencode 'end='"$(date +%s)"'000000000' \
  | jq '.data.result | length'

# Expected output: a number > 0 (the number of log streams with results)
# If output is 0: no logs found — see troubleshooting below

# In Grafana Explore:
# 1. Select data source: Loki
# 2. Enter query: {job="processengine"} |= "exec-a1b2c3d4"
# 3. Time range: Last 5 minutes
# Expected: Log lines showing execution start, node executions, completion

Step 3: Verify Metrics in Prometheus

# Query Prometheus via API to verify metrics scraping:
curl -G "http://localhost:9090/api/v1/query" \
  --data-urlencode 'query=bizfirst_workflow_executions_total' \
  | jq '.data.result | length'

# Expected output: a number > 0 (metrics series exist)

# Check that the processengine scrape target is healthy:
curl -s http://localhost:9090/api/v1/targets | \
  jq '.data.activeTargets[] | select(.labels.job == "processengine") | .health'

# Expected output: "up"
# If "down": the processengine /metrics endpoint is not reachable from Prometheus

# In Grafana Explore:
# 1. Select data source: Prometheus
# 2. Enter query: rate(bizfirst_workflow_executions_total[5m])
# Expected: A non-zero value after the test execution

Step 4: Verify Traces in Tempo

# Search for traces from the test execution in Tempo via API:
curl -G "http://localhost:3200/api/search" \
  --data-urlencode 'tags=service.name=processengine' \
  --data-urlencode 'start='"$(date -d '10 minutes ago' +%s)" \
  --data-urlencode 'end='"$(date +%s)" \
  | jq '.traces | length'

# Expected output: a number > 0

# Get a specific trace by ID (if you have the trace ID from logs):
# Look for "traceId" field in Loki log output
TRACE_ID="your-trace-id-from-logs"
curl "http://localhost:3200/api/traces/$TRACE_ID" | jq '.batches | length'

# In Grafana Explore:
# 1. Select data source: Tempo
# 2. Query mode: Search
# 3. Service Name: processengine
# Expected: Recent traces listed with duration and span counts

Common Ingestion Problems and Fixes

Symptom	Likely Cause	Fix
No logs in Loki	`OTEL_EXPORTER_OTLP_ENDPOINT` unreachable	Check network connectivity from BizFirst container to otel-collector:4317. Check firewall rules.
Loki returns data but wrong job label	Inconsistent `OTEL_SERVICE_NAME`	Verify env var is set; restart service after changing env vars.
Prometheus target shows "down"	/metrics endpoint blocked or wrong port	Check ProcessEngine exposes /metrics on the configured port. Verify Prometheus scrape_config job.
No traces in Tempo	Sampling too aggressive (0% rate)	Check `OTEL_TRACES_SAMPLER_ARG` — set to `1.0` for testing, then reduce.
OTel Collector logs show "dropped" spans	Tempo write path overloaded	Check Tempo container resources. Increase memory limit.
TraceId missing from log lines	OTel logging SDK not bridged to Serilog	Verify `ObservabilityServiceExtensions.cs` registers the OTel log bridge.

Checking the OTel Collector Pipeline

# The OTel Collector exposes its own metrics on port 8888:
curl -s http://localhost:8888/metrics | grep otelcol_receiver

# Key metrics to check:
# otelcol_receiver_accepted_spans_total{receiver="otlp"} — spans received
# otelcol_receiver_accepted_metric_points_total             — metrics received
# otelcol_receiver_accepted_log_records_total               — log records received
# otelcol_exporter_sent_spans_total{exporter="otlp/tempo"}  — spans forwarded to Tempo
# otelcol_exporter_send_failed_spans_total                  — failed exports (check for errors)

# Check OTel Collector logs for pipeline errors:
docker compose logs otel-collector --tail=50 | grep -E "error|warn|drop"

Allow 30-60 Seconds After Restart

After changing environment variables and restarting BizFirst services, allow 30-60 seconds before running verification queries. The OTel SDK buffers spans and batches them — the first batch may take up to 10 seconds to flush. Prometheus scraping runs on a 15-second interval, so the first metric data point appears within 15-30 seconds of service startup.

← Configure BizFirst Services Next: Import Pre-Built Dashboards →