Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Ollama Ollama ServerArchitecture

From Leeroopedia
Knowledge Sources
Domains Server, Architecture
Last Updated 2025-02-15 00:00 GMT

Overview

The Ollama Server Architecture defines the HTTP server infrastructure including request routing, middleware chains, model scheduling, and streaming response delivery that together provide the runtime environment for serving LLM inference requests.

Core Concepts

HTTP Routing

The server exposes a REST API with routes for model management (create, pull, push, delete, list, show), inference (generate, chat), and compatibility layers (OpenAI-compatible endpoints). Routes are registered in a central routing table that maps HTTP method and path combinations to handler functions. Each route can have its own middleware stack for authentication, request validation, and format translation.

Middleware Architecture

Middleware functions intercept requests before they reach handlers and responses before they are sent to clients. The middleware chain handles cross-cutting concerns including authentication verification, request logging, CORS headers, content negotiation, and API format translation (e.g., translating OpenAI-format requests to Ollama's internal format). Middleware is composable, allowing different endpoint groups to use different middleware stacks.

Model Scheduler

The scheduler manages the lifecycle of loaded models, determining when to load, unload, or share model instances across concurrent requests. It tracks GPU and CPU memory usage, enforces concurrency limits, and implements eviction policies when memory pressure requires unloading models. The scheduler provides a request queue that ensures fair access to model instances and prevents resource exhaustion.

Streaming Responses

Inference responses are streamed token-by-token using chunked HTTP transfer encoding or server-sent events. This allows clients to begin processing output before generation is complete, reducing perceived latency. The streaming infrastructure handles backpressure from slow clients and ensures clean connection cleanup on client disconnection or generation cancellation.

Server Initialization

At startup, the server initializes the model store, discovers available hardware (GPUs, Apple Silicon), configures memory limits, starts the scheduler, and begins listening for HTTP connections. Environment variables and command-line flags control binding address, allowed origins, model storage path, and GPU configuration.

Implementation Notes

Route registration is in server/routes.go with the main server initialization in the server/ package. The scheduler implementation is in server/sched.go. The OpenAI-compatible API layer is in openai/ and registers its own routes through server/. Middleware for authentication is in server/auth.go, and the general middleware infrastructure is in middleware/. Streaming response delivery uses Go's http.Flusher interface for chunked encoding.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment