Working Memory Storage
Working memory is deliberately not stored persistently — it exists only in process RAM for the duration of a single conversation turn. This page explains the in-memory data structures, the conversation history cache, and how working memory is reconstructed when needed.
In-Process Only: By Design
Working memory contains the complete LLM context for the current turn. It is never written to disk, SQL, or a cache store. This is a deliberate architectural choice with several benefits:
- Freshness: Each turn is assembled from the authoritative sources — SQL episodes, vector knowledge, agent config. There is no stale cache to invalidate.
- Horizontal scaling: Any Octopus API server can handle any turn for any conversation. No server affinity or sticky sessions needed.
- No sensitive data at rest: Working memory may contain the full system prompt and retrieved PII-containing knowledge. Not persisting it means it cannot be exfiltrated from storage.
- Simplicity: No cache consistency problems, no eviction strategies, no distributed cache synchronisation.
In-Memory Conversation History
Between turns within the same HTTP session, the conversation message history is held in an in-memory store keyed by conversation ID. This is a ConversationHistoryCache held in a scoped DI lifetime:
// In-memory conversation history cache (per-request scope in ASP.NET Core)
public class ConversationHistoryCache
{
// Key: ConversationId (Guid)
// Value: ordered list of messages accumulated this session
private readonly Dictionary<Guid, List<LLMMessage>> _history = new();
public IReadOnlyList<LLMMessage> GetHistory(Guid conversationId)
=> _history.TryGetValue(conversationId, out var msgs)
? msgs.AsReadOnly()
: Array.Empty<LLMMessage>();
public void AppendTurn(Guid conversationId, LLMMessage userMsg, LLMMessage assistantMsg)
{
if (!_history.ContainsKey(conversationId))
_history[conversationId] = new List<LLMMessage>();
_history[conversationId].Add(userMsg);
_history[conversationId].Add(assistantMsg);
}
}
Session Persistence via EpisodicMessages
Although working memory itself is not persisted, each message in the current session is also written to Octopus_EpisodeMessages in SQL. This allows full reconstruction of the conversation if the server restarts mid-session:
// On every turn: write message to SQL for durability
await _episodicStore.AddMessageAsync(episode.EpisodeId, new EpisodeMessage
{
Role = "user",
Content = userMessage,
TokenCount = _tokenCounter.Count(userMessage),
CreatedAt = DateTime.UtcNow
}, ct);
// If server restarts, reconstruct from SQL on next request:
var messages = await _episodicStore.GetSessionMessagesAsync(conversationId, ct);
// → Re-populate in-memory history from DB
Token Budget Management in RAM
The token budget calculation and pruning happen entirely in process, before the messages array is sent to the LLM provider:
public class WorkingMemoryManager
{
public async Task<IReadOnlyList<LLMMessage>> BuildAsync(
AgentComposite agent,
ConversationComposite conversation,
MemoryAssembly memoryAssembly,
string userMessage,
CancellationToken ct)
{
// 1. Assemble all messages (system + knowledge + history + current)
var messages = Assemble(agent, conversation, memoryAssembly, userMessage);
// 2. Count total tokens
int totalTokens = _counter.CountMessages(messages);
// 3. Prune if over budget (modifies only the in-memory list)
if (totalTokens > agent.MemoryConfig.MaxWorkingMemoryTokens)
{
messages = await _pruner.PruneAsync(
messages,
agent.MemoryConfig.MaxWorkingMemoryTokens,
ct);
}
return messages; // Sent directly to LLM provider — never written to disk
}
}
RAM Footprint Estimate
| Component | Typical Size | Notes |
|---|---|---|
| System prompt | 2–5 KB | Text only |
| Retrieved knowledge (5 chunks) | 5–20 KB | Depends on chunk size |
| Episodic snippets (3 episodes) | 2–6 KB | Summaries, not full messages |
| Message history (20 turns) | 20–100 KB | Wide variance based on message verbosity |
| Total working memory | 30–130 KB per active turn | Released when request completes |
Because working memory is rebuilt on every turn, there is no cache warm-up period after a deployment or server restart. The first turn for any conversation is as fast as any subsequent turn.