Data Flow
A complete walkthrough of how telemetry flows from a BizFirstGO service — through the Collector, into storage — and how an engineer retrieves it in Grafana. Understanding the data flow is essential for diagnosing missing telemetry and optimizing ingestion.
Complete Data Flow — Write Path
When a workflow executes, telemetry is produced and flows through the system as follows:
Service emits telemetry
The BizFirstGO ProcessEngine service begins executing a workflow. The OTel SDK immediately starts a root workflow.execute Activity span. As the workflow progresses, the SDK emits log records (via Serilog), metric samples (counter increments, histogram observations), and span events — all tagged with the same traceId.
SDK batches and exports
The OTel SDK's built-in BatchExporter accumulates telemetry and flushes it every 5 seconds (or when the batch reaches 512 items). It sends the batch to the OTel Collector via OTLP/gRPC on port 4317. For metrics, Prometheus scrapes the /metrics endpoint every 15 seconds instead.
Collector receives and processes
The OTel Collector's OTLP receiver accepts the batch. Processors run in order: memory limiter check → resource attribute enrichment → PII redaction (for logs/traces) → batching. Processing is synchronous within the pipeline.
Fan-out to storage backends
The Collector exports to three backends simultaneously: logs go to Loki via the Loki exporter, metrics go to Prometheus via remote write, and traces go to Tempo via OTLP/gRPC. These exports happen concurrently — a slow Loki write does not delay trace export.
Storage backends index and store
Loki indexes the label set and appends the log line to the appropriate stream chunk. Prometheus stores the metric sample in its WAL (Write-Ahead Log), which is compacted into TSDB blocks. Tempo writes the trace spans to its WAL, then flushes to object storage within minutes.
Complete Data Flow — Read Path
When an engineer opens Grafana to investigate an incident:
Engineer opens Grafana dashboard or Explore
Grafana renders the dashboard or Explore view. For each panel, it constructs a query (LogQL, PromQL, or TraceQL) with the selected time range and variable values.
Grafana queries the data source
Grafana sends the query to the appropriate backend: LogQL to Loki's /loki/api/v1/query_range, PromQL to Prometheus's /api/v1/query_range, or TraceQL to Tempo's /api/search.
Backend executes the query
The storage backend scans its index to find matching series/streams, fetches the relevant data blocks, applies any filters, and returns the result set to Grafana.
Grafana renders the result
Grafana transforms the raw query result into the appropriate panel visualization — time-series graph, log list, trace timeline — and displays it to the engineer.
Cross-signal correlation
If a log line contains a traceId, Grafana renders a link button. Clicking it opens Tempo's trace detail for that TraceId in a split pane. Similarly, clicking a Prometheus exemplar point on a histogram opens the linked trace in Tempo.
Write Latency Expectations
| Signal Type | From Emission to Queryable | Controlling Factor |
|---|---|---|
| Logs (Loki) | 5–30 seconds | OTel SDK batch interval (5s) + Loki ingester flush (up to 25s) |
| Metrics (Prometheus) | 15–30 seconds | Prometheus scrape interval (15s) + scrape duration |
| Traces (Tempo) | 5–60 seconds | OTel SDK batch interval (5s) + Tempo WAL flush (up to 55s) |
BizFirst Observe is a near-real-time observability system — not a real-time alerting system. Do not use it for sub-second latency monitoring. The minimum practical latency from a log line being emitted to it appearing in Grafana Explore is approximately 5–15 seconds under normal load.
Failure Modes and Resilience
| Component Failure | Impact | Recovery |
|---|---|---|
| OTel Collector down | All telemetry from services is queued in SDK buffer (default 2048 items), then dropped on overflow | SDK automatically reconnects when Collector returns; no data replay for dropped items |
| Loki down | Collector retries log export with backoff; other signals (metrics, traces) continue normally | Loki restarts, Collector resumes; WAL on Loki prevents data loss for recent logs |
| Prometheus down | Metrics scrapes fail; existing TSDB data intact; Collector buffers remote write attempts | Prometheus restarts, resumes scraping; some metric samples lost during downtime |
| Tempo down | Trace export fails; Collector retries with backoff; logs and metrics continue | Tempo restarts; traces in flight may be lost; WAL protects recent data |
| Grafana down | No user access to dashboards or alerts; all write paths continue unaffected | Grafana restarts; stateless — reconnects to backends immediately |