Portal Community

Why Run Inference as a Server Node?

Common Inference Server Backends

BackendUse CaseAPI Surface
OllamaLocal LLM inference (Llama, Mistral, Gemma)OpenAI-compatible REST API
vLLMHigh-throughput LLM serving with PagedAttentionOpenAI-compatible REST API
Triton Inference ServerNVIDIA GPU inference for any ONNX/TensorRT modelgRPC + HTTP/2
HuggingFace TGIText Generation Inference — streaming LLM outputOpenAI-compatible REST API
LM StudioDeveloper local inference (Windows/Mac)OpenAI-compatible REST API

Wrapping an Inference Backend as a Server Node

Most inference backends expose an OpenAI-compatible API. The simplest server node is a thin proxy that adds health registration and tenant header forwarding:

// Minimal inference proxy server node
var builder = WebApplication.CreateBuilder(args);

builder.Services.Configure<InferenceNodeConfig>(
    builder.Configuration.GetSection("InferenceNode"));
builder.Services.AddHttpClient("inference");
builder.Services.AddSingleton<IServerGroupRegistrar, ServerGroupRegistrar>();

var app = builder.Build();

app.MapGet("/health", async (IHttpClientFactory factory,
                              IOptions<InferenceNodeConfig> cfg) =>
{
    try
    {
        var client   = factory.CreateClient("inference");
        var response = await client.GetAsync($"{cfg.Value.BackendUrl}/health");
        return response.IsSuccessStatusCode
            ? Results.Ok(new { status = "ok" })
            : Results.Json(new { status = "degraded" }, statusCode: 503);
    }
    catch
    {
        return Results.Json(new { status = "unhealthy" }, statusCode: 503);
    }
});

// Proxy inference calls, adding audit headers
app.MapPost("/infer", async (HttpRequest req,
                              IHttpClientFactory factory,
                              IOptions<InferenceNodeConfig> cfg) =>
{
    var tenantId  = req.Headers["X-Octopus-Tenant-Id"].FirstOrDefault();
    var corrId    = req.Headers["X-Octopus-Correlation-Id"].FirstOrDefault();

    var client    = factory.CreateClient("inference");
    var upstream  = new HttpRequestMessage(HttpMethod.Post,
                        $"{cfg.Value.BackendUrl}/v1/chat/completions");

    upstream.Headers.TryAddWithoutValidation("X-Tenant-Id",      tenantId ?? "");
    upstream.Headers.TryAddWithoutValidation("X-Correlation-Id", corrId   ?? "");
    upstream.Content = new StreamContent(req.Body)
    {
        Headers = { ContentType = req.ContentType is null ? null
                    : new MediaTypeHeaderValue(req.ContentType) }
    };

    var response = await client.SendAsync(upstream,
                       HttpCompletionOption.ResponseHeadersRead);

    // Stream the response back (for SSE / token streaming)
    return Results.Stream(await response.Content.ReadAsStreamAsync(),
                          contentType: "application/json");
});

app.Run();

GPU Allocation Configuration

# Kubernetes: request GPU resources for the inference server node
spec:
  containers:
  - name: inference-node
    image: mycompany/inference-proxy:1.0.0
    resources:
      limits:
        nvidia.com/gpu: 1        # Request 1 GPU
      requests:
        nvidia.com/gpu: 1
    env:
    - name: InferenceNode__BackendUrl
      value: "http://localhost:11434"   # Ollama sidecar
    - name: InferenceNode__GroupName
      value: "inference-cluster"

  # Ollama sidecar — runs the actual model on the GPU
  - name: ollama
    image: ollama/ollama:latest
    resources:
      limits:
        nvidia.com/gpu: 1
    volumeMounts:
    - name: model-cache
      mountPath: /root/.ollama

Token Streaming from Inference Nodes

For Octopus agents that stream LLM token output to the chatbot UI, the inference node must support SSE streaming. Configure the Octopus LLM provider to point at the server group endpoint with streaming enabled:

{
  "OctopusConfig": {
    "SemanticKernelPlugin": {
      "LLM": {
        "Provider":    "CustomEndpoint",
        "BaseUrl":     "https://server-group.internal/inference-cluster",
        "ModelId":     "llama3.1:70b",
        "EnableStream": true
      }
    }
  }
}

Inference Node Capacity Planning

Model SizeMin GPU VRAMTypical ThroughputBatching Benefit
7B parameters (Q4)8 GB40-80 tokens/secModerate
13B parameters (Q4)12 GB25-50 tokens/secModerate
34B parameters (Q4)24 GB10-20 tokens/secHigh
70B parameters (Q4)48 GB (2x GPU)5-12 tokens/secHigh
GPU memory is not shared. Each inference server node claims its GPU VRAM exclusively. If you need to host multiple models, either use separate nodes per model or a multiplexing backend like Ollama (which can swap models on-demand at the cost of cold-start delay).