Validation Checklist
Run through this 10-point checklist before declaring BizFirst Observe production-ready. Every item must pass. If any item fails, resolve it and re-run the full checklist — partial observability is more dangerous than no observability because it creates a false sense of coverage.
-
1
All stack components are healthy
Run
docker compose ps(orkubectl get pods -n observe). All components show "running" / "healthy". No restarts in the last hour. Expected components: otel-collector, loki, prometheus, tempo, grafana, alertmanager. -
2
All Prometheus scrape targets are "up"
Open Prometheus at
http://localhost:9090/targets. All BizFirstGO service targets show status "up" with a recent scrape timestamp. No targets in "down" state. Check that processengine, edgestream, octopus, and node-exporter are all present. -
3
Logs appear in Loki for all services
In Grafana Explore (Loki), run
count_over_time({job=~"processengine|edgestream|octopus"}[5m]). Each service returns a non-zero count. Run within 5 minutes of a workflow execution to confirm live log flow. -
4
Traces appear in Tempo
In Grafana Explore (Tempo), search for service name "processengine". At least one trace exists from the test execution run during ingestion verification. The trace shows the
workflow.executeroot span and at least onenode.executechild span. -
5
Cross-signal correlation works
In Loki, find a log line that contains a
traceIdfield. Click the TraceId Derived Field link. The trace opens in Tempo. The trace's spans match the timeline of the log entry. Confirms the OTel SDK is injecting trace context into logs. -
6
All 10 pre-built dashboards load without errors
Open each of the 10 BizFirstGO dashboards. No panels show "Error" or "No data source". The time-series panels may show "No data" if there is no historical data yet — that is acceptable. But panels must not show configuration errors.
-
7
Dashboard variables populate correctly
On the Flow Studio Overview dashboard, click the
$tenantdropdown. It should show "All" plus any tenant IDs that have run workflows. Click the$environmentdropdown — it should show your environment names (production, staging, etc.). -
8
Alert rules are loaded and in "Normal" state
In Grafana: Alerting → Alert rules. All 6 pre-built BizFirstGO alert rules are listed. All show "Normal" state (assuming the system is healthy). No rules show "Error" state (which indicates a bad PromQL query or missing data source).
-
9
Alert notification delivery confirmed
Use Grafana's "Send test" on the configured contact point. Confirm the test message arrives in Slack (or email, or PagerDuty). The message must actually arrive — not just show "test sent" in Grafana. Check Slack channel or inbox.
-
10
Data retention is configured
Verify Loki compactor is enabled with a retention policy (check
loki-config.yamlforretention_enabled: true). Prometheus has--storage.tsdb.retention.timeset. Tempo has a retention config. For production, confirm S3 lifecycle rules are active on the storage buckets.
Checklist Sign-Off Template
# BizFirst Observe Production Readiness Sign-Off
# Date: YYYY-MM-DD
# Engineer: [name]
# Environment: production / staging
Checklist Results:
[x] 1. All stack components healthy
[x] 2. All Prometheus targets up
[x] 3. Logs in Loki for all services
[x] 4. Traces in Tempo
[x] 5. Cross-signal correlation works
[x] 6. All 10 dashboards load without errors
[x] 7. Dashboard variables populate
[x] 8. Alert rules in Normal state
[x] 9. Alert notification delivered to Slack
[x] 10. Data retention configured
Status: PASS / FAIL
Notes: [any observations]
Schedule a weekly 5-minute check: verify all Prometheus targets are still "up", check that data is still flowing in each signal, and review any alerts that fired in the past week. The observability stack is infrastructure — it needs maintenance just like any other production component.