Principle:Ggml org Llama cpp GGUF Model Loading

Knowledge Sources	Domains	Last Updated
ggml-org/llama.cpp	Model Deserialization, Quantization, Memory-Mapped I/O, GPU Layer Offloading	2026-02-14

Overview

Description

GGUF Model Loading is the core step in the llama.cpp text generation pipeline that reads quantized model weights from a GGUF container file and materializes them as tensors in host and/or device memory. This process transforms a static on-disk representation of a large language model into a runtime data structure ready for inference.

The GGUF (GGML Universal File) format is a binary container that stores model metadata (architecture, hyperparameters, tokenizer vocabulary) alongside quantized tensor data. The loading process must parse this container, validate its contents, allocate appropriate memory buffers across available compute backends, and populate those buffers with deserialized tensor data.

Usage

Model loading is performed after backend loading and before context creation. It is typically the most time-consuming and memory-intensive initialization step in the pipeline. The caller configures loading behavior through a parameters struct that controls GPU layer offloading, memory mapping, and progress reporting.

llama_model_params params = llama_model_default_params();
params.n_gpu_layers = 35;   // offload 35 layers to GPU
params.use_mmap = true;     // use memory-mapped I/O

llama_model * model = llama_model_load_from_file("model.gguf", params);

Theoretical Basis

The GGUF Container Format

GGUF is a self-describing binary format designed for efficient storage and loading of quantized neural network models. A GGUF file consists of:

Magic number and version -- identifies the file as GGUF and its format version
Metadata key-value pairs -- stores model architecture name, hyperparameters (hidden size, number of layers, number of attention heads, vocabulary size), tokenizer data, and other configuration
Tensor descriptors -- for each tensor, stores the name, shape, quantization type, and offset within the file
Tensor data -- the raw quantized weight data, laid out contiguously for efficient sequential reading

This design allows the loader to first read all metadata and tensor descriptors (a small header), then selectively load tensor data either sequentially or via memory mapping.

Memory-Mapped I/O

When use_mmap is enabled (the default), llama.cpp uses the operating system's memory mapping facilities (mmap on POSIX, MapViewOfFile on Windows) to map the GGUF file's tensor data region directly into the process address space. This provides several advantages:

Lazy loading -- pages are loaded from disk on demand as they are accessed, reducing initial load time
Shared memory -- multiple processes loading the same model file can share physical memory pages through the OS page cache
Reduced memory copies -- tensors that remain on the CPU can be used directly from the mapped region without an additional copy
Efficient large model handling -- models larger than available RAM can still be partially loaded, with the OS managing page eviction

When use_direct_io is enabled instead, the loader bypasses the OS page cache and reads tensor data directly from disk, which can be beneficial for very large models that would thrash the page cache.

Tensor Deserialization

Each tensor in the GGUF file is stored in a quantized format (e.g., Q4_0, Q4_K_M, Q8_0, F16, F32). The deserialization process:

Reads the tensor descriptor -- obtains the tensor name, dimensions, and quantization type
Allocates a buffer -- creates a ggml tensor object and allocates memory in the appropriate backend buffer (CPU or GPU)
Copies the data -- reads the raw quantized data from the file (or maps it directly) into the allocated buffer
Validates (optionally) -- if check_tensors is enabled, verifies that the tensor data contains no NaN or Inf values

The quantization type determines how the raw bytes are interpreted. For example, Q4_0 packs two 4-bit values per byte with a shared scale factor per block of 32 values, achieving approximately 4x compression relative to FP32.

GPU Layer Offloading

The n_gpu_layers parameter controls how many transformer layers are placed in GPU memory versus host (CPU) memory. The offloading strategy works as follows:

n_gpu_layers = 0 -- all layers remain on CPU; no GPU memory is used for weights
n_gpu_layers = N -- the first N transformer layers (from the input embedding side) are placed on GPU
n_gpu_layers < 0 -- all layers are placed on GPU (equivalent to full offloading)

When layers are split across CPU and GPU, the inference engine automatically handles data transfers between devices during the forward pass. Full GPU offloading eliminates these transfers and provides maximum inference speed, but requires sufficient GPU VRAM to hold all model weights plus the KV cache.

Multi-GPU Split Modes

For systems with multiple GPUs, the split_mode parameter controls how the model is distributed:

LLAMA_SPLIT_MODE_NONE -- the entire model is placed on a single GPU (specified by main_gpu)
LLAMA_SPLIT_MODE_LAYER -- individual layers are assigned to different GPUs
LLAMA_SPLIT_MODE_ROW -- individual tensor rows are split across GPUs, allowing a single layer to span multiple devices

The tensor_split array provides fine-grained control over the proportion of the model assigned to each GPU.

Vocabulary-Only Loading

When vocab_only is set to true, only the metadata and tokenizer vocabulary are loaded from the GGUF file. No tensor weight data is read. This mode is useful for applications that only need tokenization capabilities (e.g., dataset preprocessing) without the overhead of loading multi-gigabyte weight files.

Related Pages

Implementation:Ggml_org_Llama_cpp_Llama_Model_Load_From_File
Principle:Ggml_org_Llama_cpp_Backend_Loading -- backends must be loaded before model loading
Principle:Ggml_org_Llama_cpp_Inference_Context_Creation -- the next step after model loading
Heuristic:Ggml_org_Llama_cpp_GPU_Layer_Offloading_Verification

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment