Principle:Ggml org Llama cpp GGUF Model Loading
| Knowledge Sources | Domains | Last Updated |
|---|---|---|
| ggml-org/llama.cpp | Model Deserialization, Quantization, Memory-Mapped I/O, GPU Layer Offloading | 2026-02-14 |
Overview
Description
GGUF Model Loading is the core step in the llama.cpp text generation pipeline that reads quantized model weights from a GGUF container file and materializes them as tensors in host and/or device memory. This process transforms a static on-disk representation of a large language model into a runtime data structure ready for inference.
The GGUF (GGML Universal File) format is a binary container that stores model metadata (architecture, hyperparameters, tokenizer vocabulary) alongside quantized tensor data. The loading process must parse this container, validate its contents, allocate appropriate memory buffers across available compute backends, and populate those buffers with deserialized tensor data.
Usage
Model loading is performed after backend loading and before context creation. It is typically the most time-consuming and memory-intensive initialization step in the pipeline. The caller configures loading behavior through a parameters struct that controls GPU layer offloading, memory mapping, and progress reporting.
llama_model_params params = llama_model_default_params();
params.n_gpu_layers = 35; // offload 35 layers to GPU
params.use_mmap = true; // use memory-mapped I/O
llama_model * model = llama_model_load_from_file("model.gguf", params);
Theoretical Basis
The GGUF Container Format
GGUF is a self-describing binary format designed for efficient storage and loading of quantized neural network models. A GGUF file consists of:
- Magic number and version -- identifies the file as GGUF and its format version
- Metadata key-value pairs -- stores model architecture name, hyperparameters (hidden size, number of layers, number of attention heads, vocabulary size), tokenizer data, and other configuration
- Tensor descriptors -- for each tensor, stores the name, shape, quantization type, and offset within the file
- Tensor data -- the raw quantized weight data, laid out contiguously for efficient sequential reading
This design allows the loader to first read all metadata and tensor descriptors (a small header), then selectively load tensor data either sequentially or via memory mapping.
Memory-Mapped I/O
When use_mmap is enabled (the default), llama.cpp uses the operating system's memory mapping facilities (mmap on POSIX, MapViewOfFile on Windows) to map the GGUF file's tensor data region directly into the process address space. This provides several advantages:
- Lazy loading -- pages are loaded from disk on demand as they are accessed, reducing initial load time
- Shared memory -- multiple processes loading the same model file can share physical memory pages through the OS page cache
- Reduced memory copies -- tensors that remain on the CPU can be used directly from the mapped region without an additional copy
- Efficient large model handling -- models larger than available RAM can still be partially loaded, with the OS managing page eviction
When use_direct_io is enabled instead, the loader bypasses the OS page cache and reads tensor data directly from disk, which can be beneficial for very large models that would thrash the page cache.
Tensor Deserialization
Each tensor in the GGUF file is stored in a quantized format (e.g., Q4_0, Q4_K_M, Q8_0, F16, F32). The deserialization process:
- Reads the tensor descriptor -- obtains the tensor name, dimensions, and quantization type
- Allocates a buffer -- creates a ggml tensor object and allocates memory in the appropriate backend buffer (CPU or GPU)
- Copies the data -- reads the raw quantized data from the file (or maps it directly) into the allocated buffer
- Validates (optionally) -- if
check_tensorsis enabled, verifies that the tensor data contains no NaN or Inf values
The quantization type determines how the raw bytes are interpreted. For example, Q4_0 packs two 4-bit values per byte with a shared scale factor per block of 32 values, achieving approximately 4x compression relative to FP32.
GPU Layer Offloading
The n_gpu_layers parameter controls how many transformer layers are placed in GPU memory versus host (CPU) memory. The offloading strategy works as follows:
- n_gpu_layers = 0 -- all layers remain on CPU; no GPU memory is used for weights
- n_gpu_layers = N -- the first N transformer layers (from the input embedding side) are placed on GPU
- n_gpu_layers < 0 -- all layers are placed on GPU (equivalent to full offloading)
When layers are split across CPU and GPU, the inference engine automatically handles data transfers between devices during the forward pass. Full GPU offloading eliminates these transfers and provides maximum inference speed, but requires sufficient GPU VRAM to hold all model weights plus the KV cache.
Multi-GPU Split Modes
For systems with multiple GPUs, the split_mode parameter controls how the model is distributed:
- LLAMA_SPLIT_MODE_NONE -- the entire model is placed on a single GPU (specified by
main_gpu) - LLAMA_SPLIT_MODE_LAYER -- individual layers are assigned to different GPUs
- LLAMA_SPLIT_MODE_ROW -- individual tensor rows are split across GPUs, allowing a single layer to span multiple devices
The tensor_split array provides fine-grained control over the proportion of the model assigned to each GPU.
Vocabulary-Only Loading
When vocab_only is set to true, only the metadata and tokenizer vocabulary are loaded from the GGUF file. No tensor weight data is read. This mode is useful for applications that only need tokenization capabilities (e.g., dataset preprocessing) without the overhead of loading multi-gigabyte weight files.
Related Pages
- Implementation:Ggml_org_Llama_cpp_Llama_Model_Load_From_File
- Principle:Ggml_org_Llama_cpp_Backend_Loading -- backends must be loaded before model loading
- Principle:Ggml_org_Llama_cpp_Inference_Context_Creation -- the next step after model loading
- Heuristic:Ggml_org_Llama_cpp_GPU_Layer_Offloading_Verification