Portal Community

Step 1: Trigger a Test Execution

# Trigger a minimal workflow execution via the ProcessEngine API:
curl -X POST http://processengine:8080/api/workflow/execute \
  -H "Content-Type: application/json" \
  -H "X-Tenant-ID: test-tenant" \
  -d '{
    "workflowId": "wf-test-observability",
    "input": { "testMode": true }
  }'

# Note the executionId from the response:
# {"executionId": "exec-a1b2c3d4", "status": "started"}

# Save it for the queries below:
export EXEC_ID="exec-a1b2c3d4"

Step 2: Verify Logs in Loki

# Query Loki via API to verify log ingestion:
curl -G "http://localhost:3100/loki/api/v1/query_range" \
  --data-urlencode 'query={job="processengine"} |= "'"$EXEC_ID"'"' \
  --data-urlencode 'start='"$(date -d '5 minutes ago' +%s)"'000000000' \
  --data-urlencode 'end='"$(date +%s)"'000000000' \
  | jq '.data.result | length'

# Expected output: a number > 0 (the number of log streams with results)
# If output is 0: no logs found — see troubleshooting below

# In Grafana Explore:
# 1. Select data source: Loki
# 2. Enter query: {job="processengine"} |= "exec-a1b2c3d4"
# 3. Time range: Last 5 minutes
# Expected: Log lines showing execution start, node executions, completion

Step 3: Verify Metrics in Prometheus

# Query Prometheus via API to verify metrics scraping:
curl -G "http://localhost:9090/api/v1/query" \
  --data-urlencode 'query=bizfirst_workflow_executions_total' \
  | jq '.data.result | length'

# Expected output: a number > 0 (metrics series exist)

# Check that the processengine scrape target is healthy:
curl -s http://localhost:9090/api/v1/targets | \
  jq '.data.activeTargets[] | select(.labels.job == "processengine") | .health'

# Expected output: "up"
# If "down": the processengine /metrics endpoint is not reachable from Prometheus

# In Grafana Explore:
# 1. Select data source: Prometheus
# 2. Enter query: rate(bizfirst_workflow_executions_total[5m])
# Expected: A non-zero value after the test execution

Step 4: Verify Traces in Tempo

# Search for traces from the test execution in Tempo via API:
curl -G "http://localhost:3200/api/search" \
  --data-urlencode 'tags=service.name=processengine' \
  --data-urlencode 'start='"$(date -d '10 minutes ago' +%s)" \
  --data-urlencode 'end='"$(date +%s)" \
  | jq '.traces | length'

# Expected output: a number > 0

# Get a specific trace by ID (if you have the trace ID from logs):
# Look for "traceId" field in Loki log output
TRACE_ID="your-trace-id-from-logs"
curl "http://localhost:3200/api/traces/$TRACE_ID" | jq '.batches | length'

# In Grafana Explore:
# 1. Select data source: Tempo
# 2. Query mode: Search
# 3. Service Name: processengine
# Expected: Recent traces listed with duration and span counts

Common Ingestion Problems and Fixes

SymptomLikely CauseFix
No logs in LokiOTEL_EXPORTER_OTLP_ENDPOINT unreachableCheck network connectivity from BizFirstGO container to otel-collector:4317. Check firewall rules.
Loki returns data but wrong job labelInconsistent OTEL_SERVICE_NAMEVerify env var is set; restart service after changing env vars.
Prometheus target shows "down"/metrics endpoint blocked or wrong portCheck ProcessEngine exposes /metrics on the configured port. Verify Prometheus scrape_config job.
No traces in TempoSampling too aggressive (0% rate)Check OTEL_TRACES_SAMPLER_ARG — set to 1.0 for testing, then reduce.
OTel Collector logs show "dropped" spansTempo write path overloadedCheck Tempo container resources. Increase memory limit.
TraceId missing from log linesOTel logging SDK not bridged to SerilogVerify ObservabilityServiceExtensions.cs registers the OTel log bridge.

Checking the OTel Collector Pipeline

# The OTel Collector exposes its own metrics on port 8888:
curl -s http://localhost:8888/metrics | grep otelcol_receiver

# Key metrics to check:
# otelcol_receiver_accepted_spans_total{receiver="otlp"} — spans received
# otelcol_receiver_accepted_metric_points_total             — metrics received
# otelcol_receiver_accepted_log_records_total               — log records received
# otelcol_exporter_sent_spans_total{exporter="otlp/tempo"}  — spans forwarded to Tempo
# otelcol_exporter_send_failed_spans_total                  — failed exports (check for errors)

# Check OTel Collector logs for pipeline errors:
docker compose logs otel-collector --tail=50 | grep -E "error|warn|drop"
Allow 30-60 Seconds After Restart

After changing environment variables and restarting BizFirstGO services, allow 30-60 seconds before running verification queries. The OTel SDK buffers spans and batches them — the first batch may take up to 10 seconds to flush. Prometheus scraping runs on a 15-second interval, so the first metric data point appears within 15-30 seconds of service startup.