Working Memory
Working memory is the active LLM context window — what the agent is "thinking about" right now. It is transient, token-constrained, and rebuilt fresh on every conversation turn. This page summarises the key concepts; see the full Working Memory guide for deep implementation detail.
What Is Working Memory?
Every call to an LLM is stateless. The model has no persistent memory between calls — it only knows what appears in the messages array you send it. Working memory is the name Octopus gives to the assembled messages array that is constructed before each LLM call:
// Working memory is the messages[] sent to the LLM on every turn:
[
{ "role": "system", "content": "You are Aria, the HR specialist..." },
{ "role": "user", "content": "[Retrieved Knowledge]\nSource: HR Policy..." },
{ "role": "user", "content": "Hi, how many leave days do I have?" },
{ "role": "assistant", "content": "You have 12 annual leave days remaining." },
{ "role": "user", "content": "Can I take 5 days in June?" }
]
Key Characteristics
| Property | Value | Implication |
|---|---|---|
| Storage location | In-process RAM only | Nothing persists — lost when the request completes |
| Scope | Single conversation turn | Rebuilt from scratch on every LLM call |
| Size limit | Model context window (e.g. 200K tokens) | Must stay within budget; pruning kicks in when exceeded |
| Content sources | System prompt + injected knowledge + message history + current message | All four memory types contribute to working memory |
| Managed by | WorkingMemoryManager | Assembles, injects, prunes, and validates context |
What Goes Into Working Memory
The WorkingMemoryManager assembles working memory from five sources, in this order:
Token Budget and Pruning
Every agent has a MaxWorkingMemoryTokens budget. When the assembled context exceeds the budget, the pruner removes the oldest message history turns until it fits. Three pruning strategies are available:
| Strategy | How It Works | Cost |
|---|---|---|
| FIFO | Drops the oldest turns first | Zero |
| Summarize | Condenses old turns via a secondary LLM call before dropping them | Extra LLM call |
| SlidingWindow | Always keeps only the last N turn pairs | Zero |
Why Working Memory Is Transient
Working memory intentionally does not persist between turns. This design means:
- No stale context — each turn is assembled fresh from the authoritative sources (SQL, vector DB, config)
- Consistent retrieval — if a new document is indexed mid-conversation, it appears in retrieval on the next turn
- Stateless agents — any server can handle any turn; working memory can be reconstructed from the conversation ID
- No memory leak risk — the context window can never grow beyond the configured budget across turns
This is a summary page. The Working Memory full guide covers context window mechanics, token budgeting, all three pruning strategies with code, knowledge injection position, tool call history management, and the Context Inspector debugging panel.