Portal Community

Ingestion Pipeline

1

Document Upload

User uploads a file (PDF, TXT, MD) or provides a URL. IDocumentIngester.IngestAsync() is called.

2

Text Extraction

Text is extracted from the document format. PDF → PDFium text extraction. HTML URL → Playwright headless browser text. MD → plain text.

3

Chunking

Text is split into overlapping chunks by the configured IChunker. Each chunk is 200–500 tokens with 50-token overlap.

4

Embedding

Each chunk is embedded via IEmbeddingProvider.EmbedBatchAsync(). Processed in batches of 50–100 chunks per API call.

5

Vector Store Write

Each (chunk, embedding, metadata) tuple is written to the agent's vector collection via ISemanticMemoryStore.StoreAsync().

IDocumentIngester

public interface IDocumentIngester
{
    Task<IngestionResult> IngestAsync(
        IngestionRequest request,
        CancellationToken ct = default);
}

public class IngestionRequest
{
    public Guid AgentId { get; set; }
    public string TenantId { get; set; }
    public IngestionSource Source { get; set; }  // File, Url, Text
    public string? FileName { get; set; }
    public Stream? FileContent { get; set; }
    public string? Url { get; set; }
    public string? TextContent { get; set; }
    public string? Category { get; set; }        // for metadata filtering
    public ChunkingConfig? ChunkingOverride { get; set; }
}

Chunking Configuration

public class ChunkingConfig
{
    public ChunkingStrategy Strategy { get; set; } = ChunkingStrategy.FixedSize;
    public int ChunkSize { get; set; } = 400;      // target tokens per chunk
    public int ChunkOverlap { get; set; } = 50;    // overlap between consecutive chunks
    public bool SplitOnParagraphs { get; set; } = true;   // prefer paragraph boundaries
    public bool SplitOnSentences { get; set; } = true;    // then sentence boundaries
}

// Available strategies:
// FixedSize: chunk by token count, split at paragraph/sentence boundaries
// ParagraphChunker: each paragraph is one chunk (variable size)
// SentenceChunker: groups of N sentences per chunk
// SemanticChunker: similarity-based chunking (groups semantically similar sentences)

Re-Indexing

If documents change or you switch embedding models, re-indexing is required:

// Re-index a specific document
await _ingester.DeleteDocumentAsync(agentId, tenantId, documentId);
await _ingester.IngestAsync(new IngestionRequest { AgentId = agentId, ... });

// Full knowledge base re-index (after model change)
await _semanticStore.DropCollectionAsync($"agent_{agentId}_{tenantId}");
await _semanticStore.CreateCollectionAsync($"agent_{agentId}_{tenantId}", newVectorSize);
// Then re-ingest all documents