Octopus
Reranking
Reranking is an optional post-retrieval step that uses a cross-encoder model to score each retrieved chunk against the query directly — producing more precise relevance scores than the bi-encoder embedding used at retrieval time. It is the highest-quality but highest-latency retrieval option.
Bi-Encoder vs. Cross-Encoder
| Property | Bi-Encoder (retrieval) | Cross-Encoder (reranking) |
|---|---|---|
| How it works | Encodes query and document separately; compares vectors | Encodes query + document together; outputs relevance score |
| Speed | Fast — O(1) per document (precomputed embeddings) | Slow — O(n) cross-attention over all candidate pairs |
| Precision | Good — approximate relevance | Better — full attention over both texts |
| Use in pipeline | First stage — narrow millions to top-K candidates | Second stage — reorder top-K candidates by true relevance |
IReranker Interface
public interface IReranker
{
Task<IReadOnlyList<RankedResult>> RerankAsync(
string query,
IReadOnlyList<MemoryRecord> candidates,
int topN,
CancellationToken ct);
}
public class RankedResult
{
public MemoryRecord Record { get; init; } = new();
public float RerankerScore{ get; init; }
public int OriginalRank { get; init; }
public int NewRank { get; init; }
}
Supported Rerankers
| Provider | Model | Notes |
|---|---|---|
| Cohere Rerank | rerank-english-v3.0 | API call; best quality; per-request cost |
| Cohere Rerank | rerank-multilingual-v3.0 | Multi-language support |
| ONNX (local) | cross-encoder/ms-marco-MiniLM-L-6-v2 | On-premise; no API cost; CPU-intensive |
| ONNX (local) | cross-encoder/ms-marco-MiniLM-L-12-v2 | Higher quality local reranker; more CPU |
Reranking in the Retrieval Pipeline
public class RetrievalPipeline
{
public async Task<IReadOnlyList<MemoryRecord>> RetrieveAndRerankAsync(
AgentComposite agent,
string query,
float[] queryEmbedding,
CancellationToken ct)
{
// Step 1: Retrieve a larger candidate set (TopK * 3 for reranking)
int candidateK = agent.MemoryConfig.RerankerEnabled
? agent.MemoryConfig.SemanticTopK * 3
: agent.MemoryConfig.SemanticTopK;
var candidates = await _store.SearchAsync(
$"agent_{agent.Id:N}", queryEmbedding, candidateK, 0.6f, ct);
if (!agent.MemoryConfig.RerankerEnabled || !candidates.Any())
return candidates.Take(agent.MemoryConfig.SemanticTopK).ToList();
// Step 2: Rerank with cross-encoder
var ranked = await _reranker.RerankAsync(
query: query,
candidates: candidates,
topN: agent.MemoryConfig.SemanticTopK,
ct: ct);
return ranked.Select(r => r.Record).ToList();
}
}
Configuration
// appsettings.json
{
"Octopus": {
"Reranker": {
"Provider": "Cohere", // Cohere | ONNX
"Model": "rerank-english-v3.0",
"CredentialId": 44, // API key via ICredentialResolver
"TopN": 5 // Final number of results after reranking
}
}
}
// Agent memory config
{
"rerankerEnabled": true
}
Reranking Adds Latency
Each reranking call makes N cross-encoder inference passes (where N = candidate count). For 15 candidates with Cohere Rerank, expect +100–200ms per turn. For ONNX local reranking, expect +200–800ms depending on hardware. Only enable reranking if retrieval precision is the bottleneck — for most agents, bi-encoder retrieval alone is sufficient.