Octopus
Document Ingestion
Document ingestion is the first stage of the RAG pipeline. It accepts a file upload and extracts clean, structured text for downstream chunking and embedding. The IDocumentIngester abstraction supports PDF, DOCX, TXT, Markdown, and HTML.
IDocumentIngester Interface
public interface IDocumentIngester
{
// Returns true if this ingester handles the given MIME type / extension
bool CanHandle(string fileExtension, string mimeType);
// Extract text content from the uploaded file stream
Task<DocumentContent> IngestAsync(
Stream fileStream,
DocumentMetadata metadata,
CancellationToken ct);
}
public class DocumentContent
{
public string RawText { get; init; } = string.Empty;
public List<TextSection> Sections { get; init; } = new(); // Optional: heading-aware sections
public DocumentMetadata Metadata { get; init; } = new();
}
public class DocumentMetadata
{
public string Source { get; set; } = string.Empty; // Filename
public string Category { get; set; } = string.Empty; // User-assigned tag
public Guid AgentId { get; set; }
public Guid TenantId { get; set; }
public int PageCount { get; set; }
}
Upload API
// POST /api/octopus/knowledge/{agentId}/documents
// Content-Type: multipart/form-data
curl -X POST https://api.bizfirstai.com/api/octopus/knowledge/{agentId}/documents \
-H "Authorization: Bearer {token}" \
-F "file=@HR_Policy_2025.pdf" \
-F "category=Leave" \
-F "waitForIndexing=true"
// Response (synchronous if waitForIndexing=true)
{
"documentId": "doc_7f3a8b2c...",
"fileName": "HR_Policy_2025.pdf",
"chunkCount": 42,
"tokenCount": 18320,
"status": "Indexed",
"indexedAt": "2025-03-01T08:03:22Z"
}
Built-In Ingesters
| Format | Class | Library | Notes |
|---|---|---|---|
PdfDocumentIngester | PdfPig (NuGet) | Text-layer extraction; scanned PDFs require OCR pre-processing | |
| DOCX | DocxDocumentIngester | DocumentFormat.OpenXml | Paragraphs, tables, and headers extracted; images skipped |
| TXT | PlainTextIngester | BCL | UTF-8 text; no transformation |
| Markdown | MarkdownIngester | Markdig | Stripped to plain text; headings converted to section boundaries |
| HTML | HtmlIngester | HtmlAgilityPack | Body text extracted; script, style, nav elements removed |
Custom Ingesters
Register a custom ingester for proprietary formats:
// 1. Implement IDocumentIngester
public class ExcelIngester : IDocumentIngester
{
public bool CanHandle(string ext, string mime)
=> ext == ".xlsx" || mime == "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet";
public Task<DocumentContent> IngestAsync(Stream stream, DocumentMetadata meta, CancellationToken ct)
{
using var wb = new XLWorkbook(stream);
var sb = new StringBuilder();
foreach (var ws in wb.Worksheets)
foreach (var row in ws.RowsUsed())
sb.AppendLine(string.Join("\t", row.CellsUsed().Select(c => c.Value.ToString())));
return Task.FromResult(new DocumentContent { RawText = sb.ToString(), Metadata = meta });
}
}
// 2. Register with DI
services.AddSingleton<IDocumentIngester, ExcelIngester>();
Ingestion Error Handling
| Error | HTTP Status | Resolution |
|---|---|---|
| Unsupported file format | 415 Unsupported Media Type | Register a custom ingester or convert to a supported format |
| Empty document (no text extracted) | 422 Unprocessable Entity | Check if PDF is scanned image; pre-process with OCR |
| File too large (>50 MB default) | 413 Request Entity Too Large | Increase MaxUploadSizeMb in config or split the document |
| Embedding API error during indexing | 500 (async failure) | Document saved in pending state; retry via PATCH endpoint |
Asynchronous Indexing
For large documents, set waitForIndexing=false in the upload request. The document is accepted immediately and indexed in the background. Poll GET /api/octopus/knowledge/{agentId}/documents/{documentId} to check status. The document becomes searchable when status reaches Indexed.