Reranking — RAG | BizFirstAI

Bi-Encoder vs. Cross-Encoder

Property	Bi-Encoder (retrieval)	Cross-Encoder (reranking)
How it works	Encodes query and document separately; compares vectors	Encodes query + document together; outputs relevance score
Speed	Fast — O(1) per document (precomputed embeddings)	Slow — O(n) cross-attention over all candidate pairs
Precision	Good — approximate relevance	Better — full attention over both texts
Use in pipeline	First stage — narrow millions to top-K candidates	Second stage — reorder top-K candidates by true relevance

IReranker Interface

public interface IReranker
{
    Task<IReadOnlyList<RankedResult>> RerankAsync(
        string query,
        IReadOnlyList<MemoryRecord> candidates,
        int topN,
        CancellationToken ct);
}

public class RankedResult
{
    public MemoryRecord Record       { get; init; } = new();
    public float        RerankerScore{ get; init; }
    public int          OriginalRank { get; init; }
    public int          NewRank      { get; init; }
}

Supported Rerankers

Provider	Model	Notes
Cohere Rerank	rerank-english-v3.0	API call; best quality; per-request cost
Cohere Rerank	rerank-multilingual-v3.0	Multi-language support
ONNX (local)	cross-encoder/ms-marco-MiniLM-L-6-v2	On-premise; no API cost; CPU-intensive
ONNX (local)	cross-encoder/ms-marco-MiniLM-L-12-v2	Higher quality local reranker; more CPU

Reranking in the Retrieval Pipeline

public class RetrievalPipeline
{
    public async Task<IReadOnlyList<MemoryRecord>> RetrieveAndRerankAsync(
        AgentComposite agent,
        string query,
        float[] queryEmbedding,
        CancellationToken ct)
    {
        // Step 1: Retrieve a larger candidate set (TopK * 3 for reranking)
        int candidateK = agent.MemoryConfig.RerankerEnabled
            ? agent.MemoryConfig.SemanticTopK * 3
            : agent.MemoryConfig.SemanticTopK;

        var candidates = await _store.SearchAsync(
            $"agent_{agent.Id:N}", queryEmbedding, candidateK, 0.6f, ct);

        if (!agent.MemoryConfig.RerankerEnabled || !candidates.Any())
            return candidates.Take(agent.MemoryConfig.SemanticTopK).ToList();

        // Step 2: Rerank with cross-encoder
        var ranked = await _reranker.RerankAsync(
            query:      query,
            candidates: candidates,
            topN:       agent.MemoryConfig.SemanticTopK,
            ct:         ct);

        return ranked.Select(r => r.Record).ToList();
    }
}

Configuration

// appsettings.json
{
  "Octopus": {
    "Reranker": {
      "Provider":     "Cohere",       // Cohere | ONNX
      "Model":        "rerank-english-v3.0",
      "CredentialId": 44,             // API key via ICredentialResolver
      "TopN":         5               // Final number of results after reranking
    }
  }
}

// Agent memory config
{
  "rerankerEnabled": true
}

Reranking Adds Latency

Each reranking call makes N cross-encoder inference passes (where N = candidate count). For 15 candidates with Cohere Rerank, expect +100–200ms per turn. For ONNX local reranking, expect +200–800ms depending on hardware. Only enable reranking if retrieval precision is the bottleneck — for most agents, bi-encoder retrieval alone is sufficient.

← Hybrid Search Next: Context Injection →