Knowledge Indexing
Knowledge indexing is the ingestion pipeline: document → extract text → chunk → embed → store in vector DB. The knowledge-app provides a UI for this process; the API allows programmatic ingestion.
Ingestion Pipeline
Document Upload
User uploads a file (PDF, TXT, MD) or provides a URL. IDocumentIngester.IngestAsync() is called.
Text Extraction
Text is extracted from the document format. PDF → PDFium text extraction. HTML URL → Playwright headless browser text. MD → plain text.
Chunking
Text is split into overlapping chunks by the configured IChunker. Each chunk is 200–500 tokens with 50-token overlap.
Embedding
Each chunk is embedded via IEmbeddingProvider.EmbedBatchAsync(). Processed in batches of 50–100 chunks per API call.
Vector Store Write
Each (chunk, embedding, metadata) tuple is written to the agent's vector collection via ISemanticMemoryStore.StoreAsync().
IDocumentIngester
public interface IDocumentIngester
{
Task<IngestionResult> IngestAsync(
IngestionRequest request,
CancellationToken ct = default);
}
public class IngestionRequest
{
public Guid AgentId { get; set; }
public string TenantId { get; set; }
public IngestionSource Source { get; set; } // File, Url, Text
public string? FileName { get; set; }
public Stream? FileContent { get; set; }
public string? Url { get; set; }
public string? TextContent { get; set; }
public string? Category { get; set; } // for metadata filtering
public ChunkingConfig? ChunkingOverride { get; set; }
}
Chunking Configuration
public class ChunkingConfig
{
public ChunkingStrategy Strategy { get; set; } = ChunkingStrategy.FixedSize;
public int ChunkSize { get; set; } = 400; // target tokens per chunk
public int ChunkOverlap { get; set; } = 50; // overlap between consecutive chunks
public bool SplitOnParagraphs { get; set; } = true; // prefer paragraph boundaries
public bool SplitOnSentences { get; set; } = true; // then sentence boundaries
}
// Available strategies:
// FixedSize: chunk by token count, split at paragraph/sentence boundaries
// ParagraphChunker: each paragraph is one chunk (variable size)
// SentenceChunker: groups of N sentences per chunk
// SemanticChunker: similarity-based chunking (groups semantically similar sentences)
Re-Indexing
If documents change or you switch embedding models, re-indexing is required:
// Re-index a specific document
await _ingester.DeleteDocumentAsync(agentId, tenantId, documentId);
await _ingester.IngestAsync(new IngestionRequest { AgentId = agentId, ... });
// Full knowledge base re-index (after model change)
await _semanticStore.DropCollectionAsync($"agent_{agentId}_{tenantId}");
await _semanticStore.CreateCollectionAsync($"agent_{agentId}_{tenantId}", newVectorSize);
// Then re-ingest all documents