Document Ingestion — RAG

IDocumentIngester Interface

public interface IDocumentIngester
{
    // Returns true if this ingester handles the given MIME type / extension
    bool CanHandle(string fileExtension, string mimeType);

    // Extract text content from the uploaded file stream
    Task<DocumentContent> IngestAsync(
        Stream fileStream,
        DocumentMetadata metadata,
        CancellationToken ct);
}

public class DocumentContent
{
    public string            RawText   { get; init; } = string.Empty;
    public List<TextSection> Sections  { get; init; } = new();  // Optional: heading-aware sections
    public DocumentMetadata  Metadata  { get; init; } = new();
}

public class DocumentMetadata
{
    public string Source     { get; set; } = string.Empty;  // Filename
    public string Category   { get; set; } = string.Empty;  // User-assigned tag
    public Guid   AgentId    { get; set; }
    public Guid   TenantId   { get; set; }
    public int    PageCount  { get; set; }
}

Upload API

// POST /api/octopus/knowledge/{agentId}/documents
// Content-Type: multipart/form-data
curl -X POST https://api.bizfirstai.com/api/octopus/knowledge/{agentId}/documents \
  -H "Authorization: Bearer {token}" \
  -F "file=@HR_Policy_2025.pdf" \
  -F "category=Leave" \
  -F "waitForIndexing=true"

// Response (synchronous if waitForIndexing=true)
{
  "documentId": "doc_7f3a8b2c...",
  "fileName":   "HR_Policy_2025.pdf",
  "chunkCount": 42,
  "tokenCount": 18320,
  "status":     "Indexed",
  "indexedAt":  "2025-03-01T08:03:22Z"
}

Built-In Ingesters

Format	Class	Library	Notes
PDF	`PdfDocumentIngester`	PdfPig (NuGet)	Text-layer extraction; scanned PDFs require OCR pre-processing
DOCX	`DocxDocumentIngester`	DocumentFormat.OpenXml	Paragraphs, tables, and headers extracted; images skipped
TXT	`PlainTextIngester`	BCL	UTF-8 text; no transformation
Markdown	`MarkdownIngester`	Markdig	Stripped to plain text; headings converted to section boundaries
HTML	`HtmlIngester`	HtmlAgilityPack	Body text extracted; script, style, nav elements removed

Custom Ingesters

// 1. Implement IDocumentIngester
public class ExcelIngester : IDocumentIngester
{
    public bool CanHandle(string ext, string mime)
        => ext == ".xlsx" || mime == "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet";

    public Task<DocumentContent> IngestAsync(Stream stream, DocumentMetadata meta, CancellationToken ct)
    {
        using var wb = new XLWorkbook(stream);
        var sb = new StringBuilder();
        foreach (var ws in wb.Worksheets)
            foreach (var row in ws.RowsUsed())
                sb.AppendLine(string.Join("\t", row.CellsUsed().Select(c => c.Value.ToString())));

        return Task.FromResult(new DocumentContent { RawText = sb.ToString(), Metadata = meta });
    }
}

// 2. Register with DI
services.AddSingleton<IDocumentIngester, ExcelIngester>();

Ingestion Error Handling

Error	HTTP Status	Resolution
Unsupported file format	415 Unsupported Media Type	Register a custom ingester or convert to a supported format
Empty document (no text extracted)	422 Unprocessable Entity	Check if PDF is scanned image; pre-process with OCR
File too large (>50 MB default)	413 Request Entity Too Large	Increase `MaxUploadSizeMb` in config or split the document
Embedding API error during indexing	500 (async failure)	Document saved in pending state; retry via PATCH endpoint

Asynchronous Indexing

For large documents, set waitForIndexing=false in the upload request. The document is accepted immediately and indexed in the background. Poll GET /api/octopus/knowledge/{agentId}/documents/{documentId} to check status. The document becomes searchable when status reaches Indexed.

← RAG Pipeline Overview Next: Chunking Strategies →