Principle:Ollama Ollama ServerArchitecture
| Knowledge Sources | |
|---|---|
| Domains | Server, Architecture |
| Last Updated | 2025-02-15 00:00 GMT |
Overview
The Ollama Server Architecture defines the HTTP server infrastructure including request routing, middleware chains, model scheduling, and streaming response delivery that together provide the runtime environment for serving LLM inference requests.
Core Concepts
HTTP Routing
The server exposes a REST API with routes for model management (create, pull, push, delete, list, show), inference (generate, chat), and compatibility layers (OpenAI-compatible endpoints). Routes are registered in a central routing table that maps HTTP method and path combinations to handler functions. Each route can have its own middleware stack for authentication, request validation, and format translation.
Middleware Architecture
Middleware functions intercept requests before they reach handlers and responses before they are sent to clients. The middleware chain handles cross-cutting concerns including authentication verification, request logging, CORS headers, content negotiation, and API format translation (e.g., translating OpenAI-format requests to Ollama's internal format). Middleware is composable, allowing different endpoint groups to use different middleware stacks.
Model Scheduler
The scheduler manages the lifecycle of loaded models, determining when to load, unload, or share model instances across concurrent requests. It tracks GPU and CPU memory usage, enforces concurrency limits, and implements eviction policies when memory pressure requires unloading models. The scheduler provides a request queue that ensures fair access to model instances and prevents resource exhaustion.
Streaming Responses
Inference responses are streamed token-by-token using chunked HTTP transfer encoding or server-sent events. This allows clients to begin processing output before generation is complete, reducing perceived latency. The streaming infrastructure handles backpressure from slow clients and ensures clean connection cleanup on client disconnection or generation cancellation.
Server Initialization
At startup, the server initializes the model store, discovers available hardware (GPUs, Apple Silicon), configures memory limits, starts the scheduler, and begins listening for HTTP connections. Environment variables and command-line flags control binding address, allowed origins, model storage path, and GPU configuration.
Implementation Notes
Route registration is in server/routes.go with the main server initialization in the server/ package. The scheduler implementation is in server/sched.go. The OpenAI-compatible API layer is in openai/ and registers its own routes through server/. Middleware for authentication is in server/auth.go, and the general middleware infrastructure is in middleware/. Streaming response delivery uses Go's http.Flusher interface for chunked encoding.