Working Memory — Memory Overview

What Is Working Memory?

Every call to an LLM is stateless. The model has no persistent memory between calls — it only knows what appears in the messages array you send it. Working memory is the name Octopus gives to the assembled messages array that is constructed before each LLM call:

// Working memory is the messages[] sent to the LLM on every turn:
[
  { "role": "system",    "content": "You are Aria, the HR specialist..." },
  { "role": "user",      "content": "[Retrieved Knowledge]\nSource: HR Policy..." },
  { "role": "user",      "content": "Hi, how many leave days do I have?" },
  { "role": "assistant", "content": "You have 12 annual leave days remaining." },
  { "role": "user",      "content": "Can I take 5 days in June?" }
]

Key Characteristics

Property	Value	Implication
Storage location	In-process RAM only	Nothing persists — lost when the request completes
Scope	Single conversation turn	Rebuilt from scratch on every LLM call
Size limit	Model context window (e.g. 200K tokens)	Must stay within budget; pruning kicks in when exceeded
Content sources	System prompt + injected knowledge + message history + current message	All four memory types contribute to working memory
Managed by	`WorkingMemoryManager`	Assembles, injects, prunes, and validates context

What Goes Into Working Memory

The WorkingMemoryManager assembles working memory from five sources, in this order:

System Prompt Agent persona, goals, instructions. Always first — highest model attention position.

Matched Procedure If procedural memory matched a skill, its step-by-step instructions appear here.

Retrieved Knowledge Semantic memory chunks retrieved by vector similarity search on the current message.

Episodic Snippets Past conversation summaries recalled from SQL — cross-session user context.

Message History + Current Message Current session conversation turns, pruned to fit the token budget.

Token Budget and Pruning

Every agent has a MaxWorkingMemoryTokens budget. When the assembled context exceeds the budget, the pruner removes the oldest message history turns until it fits. Three pruning strategies are available:

Strategy	How It Works	Cost
FIFO	Drops the oldest turns first	Zero
Summarize	Condenses old turns via a secondary LLM call before dropping them	Extra LLM call
SlidingWindow	Always keeps only the last N turn pairs	Zero

Why Working Memory Is Transient

Working memory intentionally does not persist between turns. This design means:

No stale context — each turn is assembled fresh from the authoritative sources (SQL, vector DB, config)
Consistent retrieval — if a new document is indexed mid-conversation, it appears in retrieval on the next turn
Stateless agents — any server can handle any turn; working memory can be reconstructed from the conversation ID
No memory leak risk — the context window can never grow beyond the configured budget across turns

Full Guide

This is a summary page. The Working Memory full guide covers context window mechanics, token budgeting, all three pruning strategies with code, knowledge injection position, tool call history management, and the Context Inspector debugging panel.

← Memory System Overview Next: Episodic Memory →