Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Ollama Ollama Inference Pipeline

From Leeroopedia
Knowledge Sources
Domains Inference, Pipeline
Last Updated 2025-02-15 00:00 GMT

Overview

The Inference Pipeline is the end-to-end flow that transforms a user's natural language request into a generated response. It encompasses request validation, model loading, prompt construction, tokenization, forward pass execution, token sampling, detokenization, and response streaming, orchestrating these stages into a cohesive system.

Core Concepts

Request Processing

The inference pipeline begins with request processing: receiving an HTTP or RPC request, validating the input parameters (model name, messages, temperature, max tokens, stop sequences), and resolving the model to a loaded instance. This stage handles model loading on demand (pulling from registry if necessary, loading weights into memory, allocating GPU resources) or routing to an already-loaded model instance managed by a scheduler. Request processing also handles request queuing and concurrency control when multiple requests compete for limited GPU resources.

Prompt Construction

Once the model is resolved, the conversation messages are formatted into the raw text prompt expected by the model. This involves applying the model's chat template to the structured message list, inserting system prompts, formatting tool schemas, and handling multimodal inputs (images encoded as base64 or file references). The constructed prompt must respect the model's maximum context length, potentially truncating older messages to fit within the token budget while preserving the most recent context.

Tokenization and Batching

The text prompt is converted to token IDs using the model's tokenizer (typically a BPE, SentencePiece, or Unigram tokenizer loaded from the model's vocabulary). For the initial prompt (prefill stage), all tokens are batched together for parallel processing. For subsequent generation steps (decode stage), tokens are processed one at a time. The batching strategy significantly impacts throughput: continuous batching allows multiple requests to share a single forward pass, amortizing the overhead of GPU kernel launches.

Forward Pass and Sampling

The tokenized input is fed through the model's neural network (the forward pass) to produce logit scores for each token in the vocabulary. The sampling stage converts these logits into a probability distribution and selects the next token. Sampling strategies include greedy (argmax), temperature-scaled random sampling, top-k filtering (considering only the k highest-probability tokens), top-p/nucleus sampling (considering the smallest set of tokens whose cumulative probability exceeds p), and min-p sampling. Repetition penalties and frequency penalties may also be applied to reduce degenerate repetitive outputs.

Streaming Response

Rather than waiting for the entire response to be generated, the pipeline streams tokens to the client as they are produced. Each generated token is detokenized (converted back to text) and sent as a chunk in a streaming response (typically Server-Sent Events or newline-delimited JSON). Streaming provides lower perceived latency since the user sees text appearing incrementally. The streaming loop continues until a stop condition is met: generating an end-of-sequence token, reaching the maximum token limit, matching a stop sequence, or receiving a cancellation signal.

Context Management

Across multi-turn conversations, the pipeline must manage the growing context efficiently. This includes maintaining KV cache state between turns to avoid recomputing attention for previously processed tokens, detecting shared prefixes between consecutive requests from the same conversation, and evicting cached state for idle conversations to free memory for new requests. Efficient context management is critical for interactive chat applications where each turn builds on the previous conversation history.

Implementation Notes

In the Ollama codebase, the inference pipeline is orchestrated by a handler layer that receives API requests, resolves models via the scheduler, applies chat templates, and invokes the inference backend. The scheduler manages model loading/unloading and GPU memory allocation, keeping hot models loaded for fast subsequent requests. The inference backend (llama.cpp via CGo) handles tokenization, KV cache management, forward pass execution, and token sampling. Responses are streamed back through the handler as newline-delimited JSON chunks. The pipeline supports both chat completions and raw text completions, with the chat path applying template formatting and the raw path passing text directly to the tokenizer.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment