Working Memory Storage — Memory Storage

In-Process Only: By Design

Working memory contains the complete LLM context for the current turn. It is never written to disk, SQL, or a cache store. This is a deliberate architectural choice with several benefits:

Freshness: Each turn is assembled from the authoritative sources — SQL episodes, vector knowledge, agent config. There is no stale cache to invalidate.
Horizontal scaling: Any Octopus API server can handle any turn for any conversation. No server affinity or sticky sessions needed.
No sensitive data at rest: Working memory may contain the full system prompt and retrieved PII-containing knowledge. Not persisting it means it cannot be exfiltrated from storage.
Simplicity: No cache consistency problems, no eviction strategies, no distributed cache synchronisation.

In-Memory Conversation History

Between turns within the same HTTP session, the conversation message history is held in an in-memory store keyed by conversation ID. This is a ConversationHistoryCache held in a scoped DI lifetime:

// In-memory conversation history cache (per-request scope in ASP.NET Core)
public class ConversationHistoryCache
{
    // Key: ConversationId (Guid)
    // Value: ordered list of messages accumulated this session
    private readonly Dictionary<Guid, List<LLMMessage>> _history = new();

    public IReadOnlyList<LLMMessage> GetHistory(Guid conversationId)
        => _history.TryGetValue(conversationId, out var msgs)
            ? msgs.AsReadOnly()
            : Array.Empty<LLMMessage>();

    public void AppendTurn(Guid conversationId, LLMMessage userMsg, LLMMessage assistantMsg)
    {
        if (!_history.ContainsKey(conversationId))
            _history[conversationId] = new List<LLMMessage>();

        _history[conversationId].Add(userMsg);
        _history[conversationId].Add(assistantMsg);
    }
}

Session Persistence via EpisodicMessages

Although working memory itself is not persisted, each message in the current session is also written to Octopus_EpisodeMessages in SQL. This allows full reconstruction of the conversation if the server restarts mid-session:

// On every turn: write message to SQL for durability
await _episodicStore.AddMessageAsync(episode.EpisodeId, new EpisodeMessage
{
    Role       = "user",
    Content    = userMessage,
    TokenCount = _tokenCounter.Count(userMessage),
    CreatedAt  = DateTime.UtcNow
}, ct);

// If server restarts, reconstruct from SQL on next request:
var messages = await _episodicStore.GetSessionMessagesAsync(conversationId, ct);
// → Re-populate in-memory history from DB

Token Budget Management in RAM

The token budget calculation and pruning happen entirely in process, before the messages array is sent to the LLM provider:

public class WorkingMemoryManager
{
    public async Task<IReadOnlyList<LLMMessage>> BuildAsync(
        AgentComposite agent,
        ConversationComposite conversation,
        MemoryAssembly memoryAssembly,
        string userMessage,
        CancellationToken ct)
    {
        // 1. Assemble all messages (system + knowledge + history + current)
        var messages = Assemble(agent, conversation, memoryAssembly, userMessage);

        // 2. Count total tokens
        int totalTokens = _counter.CountMessages(messages);

        // 3. Prune if over budget (modifies only the in-memory list)
        if (totalTokens > agent.MemoryConfig.MaxWorkingMemoryTokens)
        {
            messages = await _pruner.PruneAsync(
                messages,
                agent.MemoryConfig.MaxWorkingMemoryTokens,
                ct);
        }

        return messages;  // Sent directly to LLM provider — never written to disk
    }
}

RAM Footprint Estimate

Component	Typical Size	Notes
System prompt	2–5 KB	Text only
Retrieved knowledge (5 chunks)	5–20 KB	Depends on chunk size
Episodic snippets (3 episodes)	2–6 KB	Summaries, not full messages
Message history (20 turns)	20–100 KB	Wide variance based on message verbosity
Total working memory	30–130 KB per active turn	Released when request completes

No Cache Warm-Up Needed

Because working memory is rebuilt on every turn, there is no cache warm-up period after a deployment or server restart. The first turn for any conversation is as fast as any subsequent turn.

← Vector Storage Next: Tenant Isolation →