Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Ollama Ollama Model Loading

From Leeroopedia
Knowledge Sources
Domains Model Loading, Memory Mapping
Last Updated 2025-02-15 00:00 GMT

Overview

Model Loading is the process of reading GGUF-formatted model files from disk into memory, parsing metadata and tensor descriptors, mapping weight data into addressable memory regions, and initializing the compute backend for inference.

Core Concepts

GGUF File Structure

GGUF (GGML Universal Format) is a binary format that encapsulates model metadata, tensor descriptors, and tensor data in a single file. The header contains a magic number, version, tensor count, and metadata key-value pairs. Tensor descriptors follow, specifying name, shape, data type, and byte offset. The actual tensor data is stored contiguously after the descriptors, enabling efficient memory-mapped access.

Memory Mapping

Rather than reading entire model files into heap memory, the loader uses memory-mapped I/O (mmap) to map the tensor data region directly into the process's virtual address space. This approach offers several advantages: the operating system manages physical memory allocation on demand, multiple processes can share the same physical pages for the same model, and the total memory footprint is reduced since only actively accessed pages are resident in RAM.

GPU Layer Offloading

During loading, the system determines which model layers should be placed on GPU memory versus CPU memory based on available VRAM, model size, and user configuration. The GPU layers list specifies per-device layer assignments, allowing the model to be split across multiple GPUs or kept partially on CPU when GPU memory is insufficient for the full model.

Backend Initialization

After tensor data is mapped and layer placement is determined, the backend creates compute contexts and allocates device-specific buffers. For GPU layers, tensor data is transferred from the memory-mapped region to GPU memory. The backend also pre-allocates scratch space for intermediate computations and reserves memory for the KV cache based on the configured context length.

Progress Reporting

Model loading can be time-consuming for large models. The loader provides a progress callback that reports the fraction of loading completed, enabling the UI to display a progress bar during model initialization.

Implementation Notes

The GGUF reading infrastructure is in the fs/ package which handles the binary format parsing. The ML backend's Load method (on the Backend interface defined in ml/backend.go) orchestrates the full loading process including memory mapping, GPU offloading decisions, and backend initialization. The GGML backend implementation under ml/backend/ggml/ provides the concrete loading logic that interfaces with the llama.cpp library for tensor allocation and compute graph setup.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment