AI Inference Server Node — Workflow vs Server Nodes

Why Run Inference as a Server Node?

Cold start elimination — GPU model loading takes 30-120 seconds. A persistent server node loads once and stays warm.
Cost control — Run inference on your own hardware instead of paying per-token to a cloud provider for high-volume workloads.
Data sovereignty — Sensitive prompts never leave your network when using a locally-hosted model.
Model versioning — Deploy a specific model checkpoint and roll it forward on your own schedule, independent of cloud provider updates.
Batching — A persistent server node can batch inference requests across concurrent callers, increasing GPU utilisation.

Common Inference Server Backends

Backend	Use Case	API Surface
Ollama	Local LLM inference (Llama, Mistral, Gemma)	OpenAI-compatible REST API
vLLM	High-throughput LLM serving with PagedAttention	OpenAI-compatible REST API
Triton Inference Server	NVIDIA GPU inference for any ONNX/TensorRT model	gRPC + HTTP/2
HuggingFace TGI	Text Generation Inference — streaming LLM output	OpenAI-compatible REST API
LM Studio	Developer local inference (Windows/Mac)	OpenAI-compatible REST API

Wrapping an Inference Backend as a Server Node

Most inference backends expose an OpenAI-compatible API. The simplest server node is a thin proxy that adds health registration and tenant header forwarding:

// Minimal inference proxy server node
var builder = WebApplication.CreateBuilder(args);

builder.Services.Configure<InferenceNodeConfig>(
    builder.Configuration.GetSection("InferenceNode"));
builder.Services.AddHttpClient("inference");
builder.Services.AddSingleton<IServerGroupRegistrar, ServerGroupRegistrar>();

var app = builder.Build();

app.MapGet("/health", async (IHttpClientFactory factory,
                              IOptions<InferenceNodeConfig> cfg) =>
{
    try
    {
        var client   = factory.CreateClient("inference");
        var response = await client.GetAsync($"{cfg.Value.BackendUrl}/health");
        return response.IsSuccessStatusCode
            ? Results.Ok(new { status = "ok" })
            : Results.Json(new { status = "degraded" }, statusCode: 503);
    }
    catch
    {
        return Results.Json(new { status = "unhealthy" }, statusCode: 503);
    }
});

// Proxy inference calls, adding audit headers
app.MapPost("/infer", async (HttpRequest req,
                              IHttpClientFactory factory,
                              IOptions<InferenceNodeConfig> cfg) =>
{
    var tenantId  = req.Headers["X-Octopus-Tenant-Id"].FirstOrDefault();
    var corrId    = req.Headers["X-Octopus-Correlation-Id"].FirstOrDefault();

    var client    = factory.CreateClient("inference");
    var upstream  = new HttpRequestMessage(HttpMethod.Post,
                        $"{cfg.Value.BackendUrl}/v1/chat/completions");

    upstream.Headers.TryAddWithoutValidation("X-Tenant-Id",      tenantId ?? "");
    upstream.Headers.TryAddWithoutValidation("X-Correlation-Id", corrId   ?? "");
    upstream.Content = new StreamContent(req.Body)
    {
        Headers = { ContentType = req.ContentType is null ? null
                    : new MediaTypeHeaderValue(req.ContentType) }
    };

    var response = await client.SendAsync(upstream,
                       HttpCompletionOption.ResponseHeadersRead);

    // Stream the response back (for SSE / token streaming)
    return Results.Stream(await response.Content.ReadAsStreamAsync(),
                          contentType: "application/json");
});

app.Run();

GPU Allocation Configuration

# Kubernetes: request GPU resources for the inference server node
spec:
  containers:
  - name: inference-node
    image: mycompany/inference-proxy:1.0.0
    resources:
      limits:
        nvidia.com/gpu: 1        # Request 1 GPU
      requests:
        nvidia.com/gpu: 1
    env:
    - name: InferenceNode__BackendUrl
      value: "http://localhost:11434"   # Ollama sidecar
    - name: InferenceNode__GroupName
      value: "inference-cluster"

  # Ollama sidecar — runs the actual model on the GPU
  - name: ollama
    image: ollama/ollama:latest
    resources:
      limits:
        nvidia.com/gpu: 1
    volumeMounts:
    - name: model-cache
      mountPath: /root/.ollama

Token Streaming from Inference Nodes

For Octopus agents that stream LLM token output to the chatbot UI, the inference node must support SSE streaming. Configure the Octopus LLM provider to point at the server group endpoint with streaming enabled:

{
  "OctopusConfig": {
    "SemanticKernelPlugin": {
      "LLM": {
        "Provider":    "CustomEndpoint",
        "BaseUrl":     "https://server-group.internal/inference-cluster",
        "ModelId":     "llama3.1:70b",
        "EnableStream": true
      }
    }
  }
}

Inference Node Capacity Planning

Model Size	Min GPU VRAM	Typical Throughput	Batching Benefit
7B parameters (Q4)	8 GB	40-80 tokens/sec	Moderate
13B parameters (Q4)	12 GB	25-50 tokens/sec	Moderate
34B parameters (Q4)	24 GB	10-20 tokens/sec	High
70B parameters (Q4)	48 GB (2x GPU)	5-12 tokens/sec	High

GPU memory is not shared. Each inference server node claims its GPU VRAM exclusively. If you need to host multiple models, either use separate nodes per model or a multiplexing backend like Ollama (which can swap models on-demand at the cost of cold-start delay).

← Server Node as a Service Next: Decision Framework →