Validation Checklist

1

All stack components are healthy

Run docker compose ps (or kubectl get pods -n observe). All components show "running" / "healthy". No restarts in the last hour. Expected components: otel-collector, loki, prometheus, tempo, grafana, alertmanager.
2

All Prometheus scrape targets are "up"

Open Prometheus at http://localhost:9090/targets. All BizFirst service targets show status "up" with a recent scrape timestamp. No targets in "down" state. Check that processengine, edgestream, octopus, and node-exporter are all present.
3

Logs appear in Loki for all services

In Grafana Explore (Loki), run count_over_time({job=~"processengine|edgestream|octopus"}[5m]). Each service returns a non-zero count. Run within 5 minutes of a workflow execution to confirm live log flow.
4

Traces appear in Tempo

In Grafana Explore (Tempo), search for service name "processengine". At least one trace exists from the test execution run during ingestion verification. The trace shows the workflow.execute root span and at least one node.execute child span.
5

Cross-signal correlation works

In Loki, find a log line that contains a traceId field. Click the TraceId Derived Field link. The trace opens in Tempo. The trace's spans match the timeline of the log entry. Confirms the OTel SDK is injecting trace context into logs.
6

All 10 pre-built dashboards load without errors

Open each of the 10 BizFirst dashboards. No panels show "Error" or "No data source". The time-series panels may show "No data" if there is no historical data yet — that is acceptable. But panels must not show configuration errors.
7

Dashboard variables populate correctly

On the Flow Studio Overview dashboard, click the $tenant dropdown. It should show "All" plus any tenant IDs that have run workflows. Click the $environment dropdown — it should show your environment names (production, staging, etc.).
8

Alert rules are loaded and in "Normal" state

In Grafana: Alerting → Alert rules. All 6 pre-built BizFirst alert rules are listed. All show "Normal" state (assuming the system is healthy). No rules show "Error" state (which indicates a bad PromQL query or missing data source).
9

Alert notification delivery confirmed

Use Grafana's "Send test" on the configured contact point. Confirm the test message arrives in Slack (or email, or PagerDuty). The message must actually arrive — not just show "test sent" in Grafana. Check Slack channel or inbox.
10

Data retention is configured

Verify Loki compactor is enabled with a retention policy (check loki-config.yaml for retention_enabled: true). Prometheus has --storage.tsdb.retention.time set. Tempo has a retention config. For production, confirm S3 lifecycle rules are active on the storage buckets.

Checklist Sign-Off Template

# BizFirst Observe Production Readiness Sign-Off
# Date: YYYY-MM-DD
# Engineer: [name]
# Environment: production / staging

Checklist Results:
  [x] 1. All stack components healthy
  [x] 2. All Prometheus targets up
  [x] 3. Logs in Loki for all services
  [x] 4. Traces in Tempo
  [x] 5. Cross-signal correlation works
  [x] 6. All 10 dashboards load without errors
  [x] 7. Dashboard variables populate
  [x] 8. Alert rules in Normal state
  [x] 9. Alert notification delivered to Slack
  [x] 10. Data retention configured

Status: PASS / FAIL
Notes: [any observations]

After Passing: Set Up a Weekly Health Check

Schedule a weekly 5-minute check: verify all Prometheus targets are still "up", check that data is still flowing in each signal, and review any alerts that fired in the past week. The observability stack is infrastructure — it needs maintenance just like any other production component.

← Configure Alerts Next Guide: Data Retention →

All stack components are healthy

All Prometheus scrape targets are "up"

Logs appear in Loki for all services

Traces appear in Tempo

Cross-signal correlation works

All 10 pre-built dashboards load without errors

Dashboard variables populate correctly

Alert rules are loaded and in "Normal" state

Alert notification delivery confirmed

Data retention is configured

Checklist Sign-Off Template