Principle:Ollama Ollama Model Loading
| Knowledge Sources | |
|---|---|
| Domains | Model Loading, Memory Mapping |
| Last Updated | 2025-02-15 00:00 GMT |
Overview
Model Loading is the process of reading GGUF-formatted model files from disk into memory, parsing metadata and tensor descriptors, mapping weight data into addressable memory regions, and initializing the compute backend for inference.
Core Concepts
GGUF File Structure
GGUF (GGML Universal Format) is a binary format that encapsulates model metadata, tensor descriptors, and tensor data in a single file. The header contains a magic number, version, tensor count, and metadata key-value pairs. Tensor descriptors follow, specifying name, shape, data type, and byte offset. The actual tensor data is stored contiguously after the descriptors, enabling efficient memory-mapped access.
Memory Mapping
Rather than reading entire model files into heap memory, the loader uses memory-mapped I/O (mmap) to map the tensor data region directly into the process's virtual address space. This approach offers several advantages: the operating system manages physical memory allocation on demand, multiple processes can share the same physical pages for the same model, and the total memory footprint is reduced since only actively accessed pages are resident in RAM.
GPU Layer Offloading
During loading, the system determines which model layers should be placed on GPU memory versus CPU memory based on available VRAM, model size, and user configuration. The GPU layers list specifies per-device layer assignments, allowing the model to be split across multiple GPUs or kept partially on CPU when GPU memory is insufficient for the full model.
Backend Initialization
After tensor data is mapped and layer placement is determined, the backend creates compute contexts and allocates device-specific buffers. For GPU layers, tensor data is transferred from the memory-mapped region to GPU memory. The backend also pre-allocates scratch space for intermediate computations and reserves memory for the KV cache based on the configured context length.
Progress Reporting
Model loading can be time-consuming for large models. The loader provides a progress callback that reports the fraction of loading completed, enabling the UI to display a progress bar during model initialization.
Implementation Notes
The GGUF reading infrastructure is in the fs/ package which handles the binary format parsing. The ML backend's Load method (on the Backend interface defined in ml/backend.go) orchestrates the full loading process including memory mapping, GPU offloading decisions, and backend initialization. The GGML backend implementation under ml/backend/ggml/ provides the concrete loading logic that interfaces with the llama.cpp library for tensor allocation and compute graph setup.