Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Ollama Ollama LLM Inference Pipeline

From Leeroopedia
Knowledge Sources
Domains Inference, llama.cpp
Last Updated 2025-02-15 00:00 GMT

Overview

The LLM Inference Pipeline at the native library level encompasses the low-level operations of context management, batch construction, and decode execution within a C/C++ inference engine. This pipeline operates below the application-level orchestration, directly managing GPU resources, token buffers, and the iterative decode loop that produces one token per step.

Core Concepts

Inference Context

An inference context (often called llama_context in llama.cpp) is the runtime state container for a model inference session. It holds references to the loaded model weights, the KV cache, the compute graph allocator, the backend (CPU/GPU) configuration, and runtime parameters such as the number of threads and batch size. Creating a context allocates memory for the KV cache and compute buffers on the appropriate devices. A single model can have multiple contexts for concurrent inference, each with its own independent KV cache and state.

Batch Construction

A batch (llama_batch) is the unit of work submitted to the inference engine for a single forward pass. It contains an array of token IDs, their corresponding sequence positions, sequence IDs (for multi-sequence batching), and flags indicating which tokens require logit output (typically only the last token in each sequence during generation). Batch construction is critical for performance: during the prefill phase, the batch contains all prompt tokens; during the decode phase, it contains one token per active sequence. The batch structure enables continuous batching, where tokens from multiple independent sequences are processed in a single GPU kernel invocation.

Prefill and Decode Phases

LLM inference has two distinct computational phases. The prefill phase processes all input tokens in parallel, populating the KV cache with key-value projections for the entire prompt. This phase is compute-bound and benefits from large batch sizes. The decode phase generates tokens one at a time (per sequence), appending each new token's KV projections to the cache and producing logits for sampling. This phase is memory-bandwidth-bound because each step reads the entire KV cache but produces only one new token. Optimizing each phase requires different strategies: prefill benefits from operation fusion and large matrix multiplications, while decode benefits from memory access optimization and KV cache compression.

Token Decode Loop

The core generation loop repeatedly constructs a single-token batch, submits it to the engine for a forward pass (decode), retrieves the output logits, applies the sampling strategy to select the next token, and checks for stop conditions. Each iteration through this loop produces one token and takes time proportional to the model size and current context length (due to KV cache reads). The loop continues until an end-of-sequence token is sampled, the maximum generation length is reached, a stop sequence is detected, or the request is cancelled.

Logit Extraction and Sampling

After each forward pass, the engine produces a logit vector of size equal to the model's vocabulary (typically 32K-128K entries). The sampling pipeline transforms these raw logits into a token selection through a series of steps: applying temperature scaling, top-k filtering, top-p (nucleus) filtering, min-p filtering, repetition penalty, frequency penalty, presence penalty, and finally sampling from the resulting probability distribution. The sampled token ID is then converted back to text using the tokenizer's decode function. Some implementations support grammar-constrained sampling that masks out tokens that would violate a specified grammar.

Implementation Notes

In the Ollama codebase, the native inference pipeline is implemented through llama.cpp's context and batch APIs, accessed via the CGo bridge. The Go layer creates a llama_context with configured parameters (context size, batch size, number of GPU layers, thread count), constructs llama_batch objects for prefill and decode operations, and calls llama_decode to execute forward passes. Token sampling uses llama.cpp's sampling chain, which is configured with temperature, top-k, top-p, min-p, and repetition penalty parameters from the API request. The decode loop runs in a goroutine, streaming tokens through a channel to the HTTP response handler. Batch sizes during prefill are chunked to avoid exceeding GPU memory limits for very long prompts.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment