Incident Response in TEE — TEE

What Is (and Is Not) Available During an Incident

Tool	Standard BizFirst	TEE BizFirst	TEE Alternative
SSH into process engine	Available	Not available	None — TEE blocks all external access
Attach debugger (gdb, dotnet-trace)	Available	Not available	Pre-defined trace spans in in-TEE Tempo
Live log tail (Grafana Live)	Available	Not available	Batch log queries in external Loki (30s delay)
Query full logs with ad-hoc LogQL	Available	External: sanitized only	In-TEE Loki via attested channel (full detail)
Inspect distributed traces	Full spans in Tempo	External: trace IDs only	In-TEE Tempo via attested channel (full spans)
Memory dump / heap inspection	Available (dotnet-dump)	Not available	None — explicitly blocked by TEE design
Restart/redeploy process engine	Available	Requires re-attestation	Plan restart during low-traffic window; verify attestation after restart
Audit log review	Standard log query	External: sanitized audit log	In-TEE Loki (full audit) + WORM S3 (compliance copy)

Incident Type 1: Workflow Error Rate Spike

# Alert fires: bizfirst_workflow_error_rate > 5% for 5 minutes

# Step 1: Query external Loki for error codes (available outside TEE)
{job="tee-processengine", level="Error"}
  | json
  | errorCode != ""
  | count_over_time([5m]) by (errorCode, tenantHash)

# Step 2: Identify dominant error code
# Example result: E_CALCULATION_OVERFLOW x 47, tenantHash=t_8f3a2b1c

# Step 3: Find affected executionIds from external Loki
{job="tee-processengine", level="Error"} |= "E_CALCULATION_OVERFLOW"
  | json | executionId != ""
  | line_format "{{.executionId}} {{.nodeType}} {{.timestamp}}"

# Step 4: Query in-TEE Loki via attested channel for full detail
# (Use the bizfirstgo-tee-query CLI tool — attested channel only)
bizfirstgo-tee-query logs \
  --filter 'executionId="exec-abc-123"' \
  --time-from "2026-05-25T10:00:00Z" \
  --time-to "2026-05-25T10:30:00Z"
# Returns full in-TEE log including exception details (with computation context)

# Step 5: Check in-TEE Tempo for trace
bizfirstgo-tee-query trace \
  --trace-id "a1b2c3d4e5f6..." \
  --show-spans
# Returns full span waterfall inside TEE security boundary

Incident Type 2: Telemetry Gap (Missing Logs)

# Alert fires: no logs from tee-processengine for > 2 minutes
# Possible causes: (a) TEE crash, (b) vsock channel failure, (c) OTel Collector failure,
#                  (d) audit log chain gap (tampering concern)

# Step 1: Check external OTel Collector health
{job="otel-collector-external"} |= "tee-processengine"
  | json | message =~ ".*receiver.*"

# Step 2: Check vsock proxy connectivity (host-side log)
{job="vsock-proxy"} | json | status = "error"

# Step 3: Check if TEE process engine is alive (health endpoint)
# Health endpoint is in the TEE's approved egress list:
curl http://tee-health-proxy.internal:8080/health
# Returns: {"processengine": "healthy"} or {"processengine": "not-responding"}

# Step 4: Check audit chain continuity
{job="tee-audit"} | json | tee_instance_id = "nitro-001-us-east"
  | sort by (seq desc)
  | head(10)
# Compare seq values — a gap means the TEE may have crashed and restarted

# Step 5: If TEE restarted, verify new attestation
bizfirstgo-tee-query attestation --latest
# Returns: new TEE measurement + timestamp; compare PCR[2] to expected code version

# Step 6: If code measurement changed unexpectedly — SECURITY INCIDENT
# Escalate to security team; do not restart or alter the TEE instance
# Preserve all audit records and attestation reports for forensic analysis

Incident Type 3: Audit Chain Integrity Failure

# Alert: audit chain verification failed (sequence gap or hash mismatch)
# This is a SECURITY INCIDENT — treat as potential log tampering

# Immediate actions:
# 1. Do NOT restart the TEE instance (preserve forensic state)
# 2. Do NOT modify or delete any log records
# 3. Notify security team and compliance officer

# Step 1: Identify the gap in the audit chain
{job="tee-audit"} | json | tee_instance_id = "nitro-001-us-east"
  | sort by (seq)
# Look for non-consecutive seq values

# Step 2: Retrieve affected records from WORM S3 (tamper-proof copy)
aws s3 cp s3://bizfirstgo-tee-audit-logs/tee-audit/2026/05/25/ \
  /forensics/audit-records/ --recursive

# Step 3: Run integrity verification tool
bizfirstgo-audit-verify \
  --records-dir /forensics/audit-records/ \
  --tee-public-key tee-attestation-cert.pem \
  --from-seq 1000 --to-seq 2000

# Output example:
# seq 1042: VALID (signature ok, hash chain ok)
# seq 1043: MISSING (not present in WORM S3)
# seq 1044: VALID (signature ok, but prev_hash does not match seq 1042)
# CONCLUSION: Record 1043 was deleted after being written to S3

# Step 4: Check S3 access log for unauthorized deletion attempt
aws s3api get-bucket-logging --bucket bizfirstgo-tee-audit-logs
# COMPLIANCE: Object Lock COMPLIANCE mode prevents deletion — if record is missing
# from WORM S3, the deletion occurred before the record was written to S3

Incident Type 4: TEE Attestation Failure

# Alert: TEE attestation verification failed
# Possible causes: (a) code was modified, (b) hardware failure,
#                  (c) unauthorized deployment, (d) attestation service outage

# Step 1: Retrieve current attestation report
bizfirstgo-tee-query attestation --latest --output json > current-attestation.json

# Step 2: Compare PCR values to expected baseline
diff <(jq '.pcrs' current-attestation.json) <(jq '.pcrs' expected-attestation.json)
# Any difference in PCR[2] = code was changed
# Difference in PCR[0]/PCR[1] = kernel or hardware image changed

# Step 3: If code change is authorized (expected after deployment):
# - Verify the new PCR[2] matches the expected deployment hash
# - Update the expected-attestation.json in the attestation baseline registry
# - Verify all in-flight executions completed or were safely cancelled

# Step 4: If code change is NOT authorized:
# - SECURITY INCIDENT — escalate immediately
# - Block all new workflow executions (update load balancer health check)
# - Preserve the running TEE instance for forensic analysis
# - Do not restart or terminate the TEE instance

# Step 5: After security investigation, redeploy from known-good container image
# - Rebuild from git commit with verified hash
# - New attestation report will have expected PCR[2] value

Pre-Defined Investigation Procedures Checklist

# TEE Incident Response Readiness Checklist
# Complete before first TEE production deployment

Tooling:
[ ] bizfirstgo-tee-query CLI installed and tested (attested channel access)
[ ] bizfirstgo-audit-verify tool available with TEE public key loaded
[ ] Expected attestation baseline (expected-attestation.json) stored in secure registry
[ ] S3 audit bucket access credentials configured for forensic team

Runbooks:
[ ] Error rate spike runbook written and tested in TEE staging environment
[ ] Telemetry gap runbook written and tested
[ ] Audit chain integrity failure escalation procedure approved by security + compliance
[ ] Attestation failure security incident procedure approved by CISO

Monitoring:
[ ] Alert: bizfirst_workflow_error_rate > 5% for 5m (fires on external Prometheus)
[ ] Alert: no tee-processengine logs for 2m (fires on external Loki)
[ ] Alert: audit chain sequence gap detected (fires on audit verification job)
[ ] Alert: attestation verification failed (fires on attestation monitor)

Access:
[ ] Forensic team members have attested channel credentials
[ ] WORM S3 read-only access for forensic team confirmed
[ ] Escalation contacts (security, compliance, CISO) documented and tested

Practice Incident Response Before Production

The constraints of TEE incident response are unlike any other environment. Teams that have never operated a TEE deployment will be unfamiliar with the absence of SSH and debugger access when an incident occurs under time pressure. Run tabletop exercises and simulated incidents in a TEE staging environment before going to production. The runbooks and tooling must be rehearsed — discovering you cannot connect to in-TEE Loki during an actual incident is too late.

← Audit Logging in TEE ← Back to Portal