Knowledge Indexing — Semantic Memory

Ingestion Pipeline

Document Upload

User uploads a file (PDF, TXT, MD) or provides a URL. IDocumentIngester.IngestAsync() is called.

Text Extraction

Text is extracted from the document format. PDF → PDFium text extraction. HTML URL → Playwright headless browser text. MD → plain text.

Chunking

Text is split into overlapping chunks by the configured IChunker. Each chunk is 200–500 tokens with 50-token overlap.

Embedding

Each chunk is embedded via IEmbeddingProvider.EmbedBatchAsync(). Processed in batches of 50–100 chunks per API call.

Vector Store Write

Each (chunk, embedding, metadata) tuple is written to the agent's vector collection via ISemanticMemoryStore.StoreAsync().

IDocumentIngester

public interface IDocumentIngester
{
    Task<IngestionResult> IngestAsync(
        IngestionRequest request,
        CancellationToken ct = default);
}

public class IngestionRequest
{
    public Guid AgentId { get; set; }
    public string TenantId { get; set; }
    public IngestionSource Source { get; set; }  // File, Url, Text
    public string? FileName { get; set; }
    public Stream? FileContent { get; set; }
    public string? Url { get; set; }
    public string? TextContent { get; set; }
    public string? Category { get; set; }        // for metadata filtering
    public ChunkingConfig? ChunkingOverride { get; set; }
}

Chunking Configuration

public class ChunkingConfig
{
    public ChunkingStrategy Strategy { get; set; } = ChunkingStrategy.FixedSize;
    public int ChunkSize { get; set; } = 400;      // target tokens per chunk
    public int ChunkOverlap { get; set; } = 50;    // overlap between consecutive chunks
    public bool SplitOnParagraphs { get; set; } = true;   // prefer paragraph boundaries
    public bool SplitOnSentences { get; set; } = true;    // then sentence boundaries
}

// Available strategies:
// FixedSize: chunk by token count, split at paragraph/sentence boundaries
// ParagraphChunker: each paragraph is one chunk (variable size)
// SentenceChunker: groups of N sentences per chunk
// SemanticChunker: similarity-based chunking (groups semantically similar sentences)

Re-Indexing

If documents change or you switch embedding models, re-indexing is required:

// Re-index a specific document
await _ingester.DeleteDocumentAsync(agentId, tenantId, documentId);
await _ingester.IngestAsync(new IngestionRequest { AgentId = agentId, ... });

// Full knowledge base re-index (after model change)
await _semanticStore.DropCollectionAsync($"agent_{agentId}_{tenantId}");
await _semanticStore.CreateCollectionAsync($"agent_{agentId}_{tenantId}", newVectorSize);
// Then re-ingest all documents

← Vector Store Backends Next: Retrieval at Query Time →