Octopus — AI
AI Inference Server Node
An AI inference server node hosts a local or self-managed AI model (LLM, embedding model, classifier) as a server group member. Octopus agents and workflows call the inference node over HTTP — the same as any other server group service.
Why Run Inference as a Server Node?
- Cold start elimination — GPU model loading takes 30-120 seconds. A persistent server node loads once and stays warm.
- Cost control — Run inference on your own hardware instead of paying per-token to a cloud provider for high-volume workloads.
- Data sovereignty — Sensitive prompts never leave your network when using a locally-hosted model.
- Model versioning — Deploy a specific model checkpoint and roll it forward on your own schedule, independent of cloud provider updates.
- Batching — A persistent server node can batch inference requests across concurrent callers, increasing GPU utilisation.
Common Inference Server Backends
| Backend | Use Case | API Surface |
|---|---|---|
| Ollama | Local LLM inference (Llama, Mistral, Gemma) | OpenAI-compatible REST API |
| vLLM | High-throughput LLM serving with PagedAttention | OpenAI-compatible REST API |
| Triton Inference Server | NVIDIA GPU inference for any ONNX/TensorRT model | gRPC + HTTP/2 |
| HuggingFace TGI | Text Generation Inference — streaming LLM output | OpenAI-compatible REST API |
| LM Studio | Developer local inference (Windows/Mac) | OpenAI-compatible REST API |
Wrapping an Inference Backend as a Server Node
Most inference backends expose an OpenAI-compatible API. The simplest server node is a thin proxy that adds health registration and tenant header forwarding:
// Minimal inference proxy server node
var builder = WebApplication.CreateBuilder(args);
builder.Services.Configure<InferenceNodeConfig>(
builder.Configuration.GetSection("InferenceNode"));
builder.Services.AddHttpClient("inference");
builder.Services.AddSingleton<IServerGroupRegistrar, ServerGroupRegistrar>();
var app = builder.Build();
app.MapGet("/health", async (IHttpClientFactory factory,
IOptions<InferenceNodeConfig> cfg) =>
{
try
{
var client = factory.CreateClient("inference");
var response = await client.GetAsync($"{cfg.Value.BackendUrl}/health");
return response.IsSuccessStatusCode
? Results.Ok(new { status = "ok" })
: Results.Json(new { status = "degraded" }, statusCode: 503);
}
catch
{
return Results.Json(new { status = "unhealthy" }, statusCode: 503);
}
});
// Proxy inference calls, adding audit headers
app.MapPost("/infer", async (HttpRequest req,
IHttpClientFactory factory,
IOptions<InferenceNodeConfig> cfg) =>
{
var tenantId = req.Headers["X-Octopus-Tenant-Id"].FirstOrDefault();
var corrId = req.Headers["X-Octopus-Correlation-Id"].FirstOrDefault();
var client = factory.CreateClient("inference");
var upstream = new HttpRequestMessage(HttpMethod.Post,
$"{cfg.Value.BackendUrl}/v1/chat/completions");
upstream.Headers.TryAddWithoutValidation("X-Tenant-Id", tenantId ?? "");
upstream.Headers.TryAddWithoutValidation("X-Correlation-Id", corrId ?? "");
upstream.Content = new StreamContent(req.Body)
{
Headers = { ContentType = req.ContentType is null ? null
: new MediaTypeHeaderValue(req.ContentType) }
};
var response = await client.SendAsync(upstream,
HttpCompletionOption.ResponseHeadersRead);
// Stream the response back (for SSE / token streaming)
return Results.Stream(await response.Content.ReadAsStreamAsync(),
contentType: "application/json");
});
app.Run();
GPU Allocation Configuration
# Kubernetes: request GPU resources for the inference server node
spec:
containers:
- name: inference-node
image: mycompany/inference-proxy:1.0.0
resources:
limits:
nvidia.com/gpu: 1 # Request 1 GPU
requests:
nvidia.com/gpu: 1
env:
- name: InferenceNode__BackendUrl
value: "http://localhost:11434" # Ollama sidecar
- name: InferenceNode__GroupName
value: "inference-cluster"
# Ollama sidecar — runs the actual model on the GPU
- name: ollama
image: ollama/ollama:latest
resources:
limits:
nvidia.com/gpu: 1
volumeMounts:
- name: model-cache
mountPath: /root/.ollama
Token Streaming from Inference Nodes
For Octopus agents that stream LLM token output to the chatbot UI, the inference node must support SSE streaming. Configure the Octopus LLM provider to point at the server group endpoint with streaming enabled:
{
"OctopusConfig": {
"SemanticKernelPlugin": {
"LLM": {
"Provider": "CustomEndpoint",
"BaseUrl": "https://server-group.internal/inference-cluster",
"ModelId": "llama3.1:70b",
"EnableStream": true
}
}
}
}
Inference Node Capacity Planning
| Model Size | Min GPU VRAM | Typical Throughput | Batching Benefit |
|---|---|---|---|
| 7B parameters (Q4) | 8 GB | 40-80 tokens/sec | Moderate |
| 13B parameters (Q4) | 12 GB | 25-50 tokens/sec | Moderate |
| 34B parameters (Q4) | 24 GB | 10-20 tokens/sec | High |
| 70B parameters (Q4) | 48 GB (2x GPU) | 5-12 tokens/sec | High |
GPU memory is not shared. Each inference server node claims its GPU VRAM exclusively. If you need to host multiple models, either use separate nodes per model or a multiplexing backend like Ollama (which can swap models on-demand at the cost of cold-start delay).