Data Flow — Physical Architecture

Complete Data Flow — Write Path

When a workflow executes, telemetry is produced and flows through the system as follows:

Service emits telemetry

The BizFirst ProcessEngine service begins executing a workflow. The OTel SDK immediately starts a root workflow.execute Activity span. As the workflow progresses, the SDK emits log records (via Serilog), metric samples (counter increments, histogram observations), and span events — all tagged with the same traceId.

SDK batches and exports

The OTel SDK's built-in BatchExporter accumulates telemetry and flushes it every 5 seconds (or when the batch reaches 512 items). It sends the batch to the OTel Collector via OTLP/gRPC on port 4317. For metrics, Prometheus scrapes the /metrics endpoint every 15 seconds instead.

Collector receives and processes

The OTel Collector's OTLP receiver accepts the batch. Processors run in order: memory limiter check → resource attribute enrichment → PII redaction (for logs/traces) → batching. Processing is synchronous within the pipeline.

Fan-out to storage backends

The Collector exports to three backends simultaneously: logs go to Loki via the Loki exporter, metrics go to Prometheus via remote write, and traces go to Tempo via OTLP/gRPC. These exports happen concurrently — a slow Loki write does not delay trace export.

Storage backends index and store

Loki indexes the label set and appends the log line to the appropriate stream chunk. Prometheus stores the metric sample in its WAL (Write-Ahead Log), which is compacted into TSDB blocks. Tempo writes the trace spans to its WAL, then flushes to object storage within minutes.

Complete Data Flow — Read Path

When an engineer opens Grafana to investigate an incident:

Engineer opens Grafana dashboard or Explore

Grafana renders the dashboard or Explore view. For each panel, it constructs a query (LogQL, PromQL, or TraceQL) with the selected time range and variable values.

Grafana queries the data source

Grafana sends the query to the appropriate backend: LogQL to Loki's /loki/api/v1/query_range, PromQL to Prometheus's /api/v1/query_range, or TraceQL to Tempo's /api/search.

Backend executes the query

The storage backend scans its index to find matching series/streams, fetches the relevant data blocks, applies any filters, and returns the result set to Grafana.

Grafana renders the result

Grafana transforms the raw query result into the appropriate panel visualization — time-series graph, log list, trace timeline — and displays it to the engineer.

Cross-signal correlation

If a log line contains a traceId, Grafana renders a link button. Clicking it opens Tempo's trace detail for that TraceId in a split pane. Similarly, clicking a Prometheus exemplar point on a histogram opens the linked trace in Tempo.

Write Latency Expectations

Signal Type	From Emission to Queryable	Controlling Factor
Logs (Loki)	5–30 seconds	OTel SDK batch interval (5s) + Loki ingester flush (up to 25s)
Metrics (Prometheus)	15–30 seconds	Prometheus scrape interval (15s) + scrape duration
Traces (Tempo)	5–60 seconds	OTel SDK batch interval (5s) + Tempo WAL flush (up to 55s)

Not Real-Time

BizFirst Observe is a near-real-time observability system — not a real-time alerting system. Do not use it for sub-second latency monitoring. The minimum practical latency from a log line being emitted to it appearing in Grafana Explore is approximately 5–15 seconds under normal load.

Failure Modes and Resilience

Component Failure	Impact	Recovery
OTel Collector down	All telemetry from services is queued in SDK buffer (default 2048 items), then dropped on overflow	SDK automatically reconnects when Collector returns; no data replay for dropped items
Loki down	Collector retries log export with backoff; other signals (metrics, traces) continue normally	Loki restarts, Collector resumes; WAL on Loki prevents data loss for recent logs
Prometheus down	Metrics scrapes fail; existing TSDB data intact; Collector buffers remote write attempts	Prometheus restarts, resumes scraping; some metric samples lost during downtime
Tempo down	Trace export fails; Collector retries with backoff; logs and metrics continue	Tempo restarts; traces in flight may be lost; WAL protects recent data
Grafana down	No user access to dashboards or alerts; all write paths continue unaffected	Grafana restarts; stateless — reconnects to backends immediately

← Visualization Layer Next: Network Topology →