Principle:Ollama Ollama Compute Graph System
| Knowledge Sources | |
|---|---|
| Domains | ML Framework, Compute Graph |
| Last Updated | 2025-02-15 00:00 GMT |
Overview
A Compute Graph System is a framework for representing and executing machine learning computations as directed acyclic graphs (DAGs), where nodes represent tensor operations and edges represent data dependencies. This abstraction enables automatic memory planning, operation fusion, hardware-specific kernel dispatch, and efficient scheduling of LLM inference workloads.
Core Concepts
Directed Acyclic Graph Representation
In a compute graph, each node represents a mathematical operation (matrix multiplication, attention, layer normalization, activation functions) and each directed edge represents a tensor flowing from one operation's output to another's input. The DAG structure captures the full dependency chain of a neural network forward pass. This declarative representation separates the specification of what to compute from how to compute it, allowing the framework to optimize execution order, memory allocation, and hardware utilization independently of the model definition.
Graph Construction for Transformer Models
Constructing a compute graph for a transformer-based LLM involves building nodes for each architectural component: token embedding lookup, positional encoding (RoPE, ALiBi), multi-head self-attention (Q/K/V projections, scaled dot-product attention, output projection), feed-forward networks (up/down/gate projections with activation functions), layer normalization (RMSNorm, LayerNorm), and the final language model head. The graph is typically built dynamically per forward pass, as the sequence length and batch composition may vary.
Memory Planning and Buffer Allocation
The compute graph framework analyzes the DAG to determine optimal memory allocation for intermediate tensors. Through liveness analysis, the framework identifies when each tensor is first produced and last consumed, enabling memory reuse: buffers for tensors whose lifetimes do not overlap can share the same physical memory. This is critical for LLM inference where activation tensors can consume gigabytes of memory. The framework may also support in-place operations where an operation's output overwrites one of its inputs.
Backend Dispatch
A single compute graph can be executed on different hardware backends (CPU, CUDA, Metal, ROCm, Vulkan) by dispatching each node to the appropriate kernel implementation. The dispatch layer selects optimized kernels based on the operation type, tensor shapes, data types (FP32, FP16, BF16, quantized INT4/INT8), and available hardware. This allows the same model definition to run on different hardware without modification, with the framework handling data transfer between devices when operations span multiple accelerators.
Operation Fusion
Graph-level optimization can fuse sequences of simple operations into single, more efficient composite kernels. Common fusions in LLM inference include fusing layer normalization with the subsequent linear projection, fusing the attention score computation with the softmax, and fusing element-wise operations (bias add, activation function, residual connection) into a single kernel launch. Fusion reduces memory bandwidth requirements by avoiding materialization of intermediate tensors and reduces kernel launch overhead.
Implementation Notes
In the Ollama codebase, the compute graph system is built on GGML (Georgi Gerganov's Machine Learning library), a C-based tensor computation framework. Each LLM forward pass constructs a GGML compute graph by sequentially adding operation nodes for each transformer layer. The graph builder receives the model weights, input token embeddings, KV cache state, and positional information, and outputs a graph that computes logits for the next token. GGML handles memory planning, buffer allocation, and backend dispatch to CPU (with SIMD optimizations), CUDA, Metal, ROCm, or Vulkan kernels depending on the available hardware. The graph construction is defined in C/C++ source files that specify the exact sequence of tensor operations for each supported model architecture.