Embedding Generation — Data Enrichment & Enhancement

What Are Embeddings?

A vector embedding is a mathematical representation of text content as a high-dimensional numeric array (typically 1536 or 3072 floating-point numbers). Two pieces of text that are semantically similar will have embeddings that are close to each other in vector space — even if they share no keywords.

For example, the sentences "I need to cancel my subscription" and "Please close my account" have very different words but very similar meanings — their embeddings will be close in vector space. This enables semantic search that keyword-based search cannot achieve.

The IEmbeddingProvider Interface

The Octopus framework provides an IEmbeddingProvider abstraction that workflow nodes use to generate embeddings. The implementation is swappable (Azure OpenAI, OpenAI, Ollama) without changing the workflow:

// IEmbeddingProvider interface
public interface IEmbeddingProvider
{
    Task<float[]> GenerateEmbeddingAsync(
        string text,
        string? modelId = null,
        CancellationToken ct = default);

    Task<string> StoreEmbeddingAsync(
        string collectionName,
        float[] embedding,
        Dictionary<string, object> metadata,
        CancellationToken ct = default);  // Returns embeddingRef (vector store ID)
}

// Configured provider options:
// - AzureOpenAIEmbeddingProvider: Uses text-embedding-3-large (3072 dims)
// - OpenAIEmbeddingProvider: Uses text-embedding-3-small (1536 dims)
// - OllamaEmbeddingProvider: Uses mxbai-embed-large (local, 1024 dims)

Embedding Workflow Node Configuration

// EmbeddingGenerationNode configuration
{
  "nodeType": "EmbeddingGenerationNode",
  "nodeId": "generate-lead-embedding",

  // What text to embed — combine relevant fields for richer semantic representation
  "inputText": "Lead Name: {{lead.FullName}}\nCompany: {{lead.CompanyName}}\nIndustry: {{lead.Industry}}\nSource: {{lead.Source}}\nNotes: {{lead.Notes}}\nSummary: {{variables.summaryResult.summary}}",

  "collectionName": "leads",     // Vector store collection/index name
  "metadata": {
    "tenantId": "{{workflow.tenantId}}",
    "recordId": "{{input.leadId}}",
    "recordType": "Lead",
    "datasourceId": "sales-data-db"
  },
  "outputVariable": "embeddingRef"   // Returns the vector store entry ID
}

// After this node:
// embeddingRef = "qdrant_abc123..."
// Write embeddingRef back to Lead table via SqlUpdateNode

What Goes Into the Embedding Text

The quality of semantic search depends on what content you embed. Best practices:

Content to Include	Content to Exclude	Reason
Business content fields (name, description, notes)	Technical IDs (LeadId, TenantId)	IDs are meaningless to the embedding model
AI-generated summary (SummaryText)	Status/classification labels	Summary captures semantic meaning; labels are searchable as exact matches in SQL
Category/domain context (industry, product, type)	Timestamps and audit columns	Dates don't contribute to semantic meaning
Long-form notes and descriptions	Empty/null fields	Null fields add noise without value

Embedding Text Template Example

// Good embedding text for a Lead record:
"Lead: Jane Smith
Company: Acme Corp (Software Industry)
Source: Conference
Summary: Senior VP of Engineering at a mid-market SaaS company. Expressed strong interest in enterprise workflow automation. Currently using a competitor product but evaluating alternatives. High purchase intent based on conversation notes.
Notes: Met at Workflow Summit 2026. Asked specifically about multi-tenant support and compliance features. Wants a demo with their IT team."

// This captures the semantic meaning of who this lead is and what they need
// A search for "enterprise workflow buyer interested in compliance" will find this record

Vector Store Architecture

Embeddings are stored in a separate vector store — not in SQL Server. The two stores are linked by the EmbeddingRef column:

SQL Server (DataOcean_SalesData)          Vector Store (Qdrant)
┌──────────────────────────────────┐    ┌─────────────────────────────┐
│ Lead table                       │    │ leads collection             │
│   LeadId (PK GUID)               │    │   id: "qdrant_abc123"        │
│   ...business columns...         │    │   vector: [0.023, -0.891...] │
│   EmbeddingRef: "qdrant_abc123" ──┼───►│   metadata.recordId: LeadId │
│   AiProcessedAt: 2026-05-25      │    │   metadata.tenantId: 42      │
└──────────────────────────────────┘    └─────────────────────────────┘

Semantic Search Flow

User Query

User types a natural language query: "Find leads from the healthcare industry interested in compliance features"

Query Embedding

The query text is embedded with the same model used for the records. Result: a query vector.

Vector Similarity Search

Qdrant finds the top-K most similar record embeddings using ANN (Approximate Nearest Neighbor) search. Returns embedding IDs with similarity scores.

SQL Record Retrieval

The embedding IDs are mapped back to Lead records via WHERE EmbeddingRef IN (...). Full lead records are returned, ordered by semantic similarity score.

Keep Embedding and SQL Data in Sync

When a Lead record is updated (especially Notes or key business fields), the embedding should be regenerated. Trigger the embedding workflow on update as well as create. Stale embeddings will produce incorrect similarity search results. The AiProcessedAt timestamp helps identify records whose embedding may be older than the last update (WHERE UpdatedAt > AiProcessedAt).

← AI-Driven Enhancement Next: Building an AI-Ready Database →