Principle:Ollama Ollama LLM Inference Pipeline
| Knowledge Sources | |
|---|---|
| Domains | Inference, llama.cpp |
| Last Updated | 2025-02-15 00:00 GMT |
Overview
The LLM Inference Pipeline at the native library level encompasses the low-level operations of context management, batch construction, and decode execution within a C/C++ inference engine. This pipeline operates below the application-level orchestration, directly managing GPU resources, token buffers, and the iterative decode loop that produces one token per step.
Core Concepts
Inference Context
An inference context (often called llama_context in llama.cpp) is the runtime state container for a model inference session. It holds references to the loaded model weights, the KV cache, the compute graph allocator, the backend (CPU/GPU) configuration, and runtime parameters such as the number of threads and batch size. Creating a context allocates memory for the KV cache and compute buffers on the appropriate devices. A single model can have multiple contexts for concurrent inference, each with its own independent KV cache and state.
Batch Construction
A batch (llama_batch) is the unit of work submitted to the inference engine for a single forward pass. It contains an array of token IDs, their corresponding sequence positions, sequence IDs (for multi-sequence batching), and flags indicating which tokens require logit output (typically only the last token in each sequence during generation). Batch construction is critical for performance: during the prefill phase, the batch contains all prompt tokens; during the decode phase, it contains one token per active sequence. The batch structure enables continuous batching, where tokens from multiple independent sequences are processed in a single GPU kernel invocation.
Prefill and Decode Phases
LLM inference has two distinct computational phases. The prefill phase processes all input tokens in parallel, populating the KV cache with key-value projections for the entire prompt. This phase is compute-bound and benefits from large batch sizes. The decode phase generates tokens one at a time (per sequence), appending each new token's KV projections to the cache and producing logits for sampling. This phase is memory-bandwidth-bound because each step reads the entire KV cache but produces only one new token. Optimizing each phase requires different strategies: prefill benefits from operation fusion and large matrix multiplications, while decode benefits from memory access optimization and KV cache compression.
Token Decode Loop
The core generation loop repeatedly constructs a single-token batch, submits it to the engine for a forward pass (decode), retrieves the output logits, applies the sampling strategy to select the next token, and checks for stop conditions. Each iteration through this loop produces one token and takes time proportional to the model size and current context length (due to KV cache reads). The loop continues until an end-of-sequence token is sampled, the maximum generation length is reached, a stop sequence is detected, or the request is cancelled.
Logit Extraction and Sampling
After each forward pass, the engine produces a logit vector of size equal to the model's vocabulary (typically 32K-128K entries). The sampling pipeline transforms these raw logits into a token selection through a series of steps: applying temperature scaling, top-k filtering, top-p (nucleus) filtering, min-p filtering, repetition penalty, frequency penalty, presence penalty, and finally sampling from the resulting probability distribution. The sampled token ID is then converted back to text using the tokenizer's decode function. Some implementations support grammar-constrained sampling that masks out tokens that would violate a specified grammar.
Implementation Notes
In the Ollama codebase, the native inference pipeline is implemented through llama.cpp's context and batch APIs, accessed via the CGo bridge. The Go layer creates a llama_context with configured parameters (context size, batch size, number of GPU layers, thread count), constructs llama_batch objects for prefill and decode operations, and calls llama_decode to execute forward passes. Token sampling uses llama.cpp's sampling chain, which is configured with temperature, top-k, top-p, min-p, and repetition penalty parameters from the API request. The decode loop runs in a goroutine, streaming tokens through a channel to the HTTP response handler. Batch sizes during prefill are chunked to avoid exceeding GPU memory limits for very long prompts.