Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Ggml org Llama cpp Llama Model Load From File

From Leeroopedia
Knowledge Sources Domains Last Updated
ggml-org/llama.cpp GGUF Parsing, Model Deserialization, GPU Offloading 2026-02-14

Overview

Description

llama_model_load_from_file loads a quantized language model from a GGUF file on disk and returns an opaque model handle. This function reads the model's metadata (architecture, hyperparameters, tokenizer vocabulary), allocates tensor buffers across available compute backends, and populates those buffers with the deserialized weight data. The returned llama_model pointer is then used to create one or more inference contexts.

If the model is split across multiple files following the naming pattern <name>-00001-of-00005.gguf, the function automatically discovers and loads all split files. For custom naming schemes, use llama_model_load_from_splits instead.

Usage

#include "llama.h"

// Get default parameters and customize
llama_model_params params = llama_model_default_params();
params.n_gpu_layers = 99;  // offload all possible layers to GPU
params.use_mmap = true;    // enable memory-mapped I/O

// Load the model
llama_model * model = llama_model_load_from_file("path/to/model.gguf", params);
if (model == NULL) {
    fprintf(stderr, "Failed to load model\n");
    return 1;
}

// Use the model to create contexts, get vocab, etc.
const llama_vocab * vocab = llama_model_get_vocab(model);

// Free when done
llama_model_free(model);

Code Reference

Source Location

File Line(s) Type
include/llama.h 450-452 Declaration
src/llama.cpp 1029-1034 Implementation

Signature

LLAMA_API struct llama_model * llama_model_load_from_file(
                         const char * path_model,
          struct llama_model_params   params);

Import

#include "llama.h"

I/O Contract

Inputs

Parameter Type Description
path_model const char * Path to the GGUF model file. If the model is split into multiple parts, the file name must follow the pattern <name>-%05d-of-%05d.gguf.
params struct llama_model_params Configuration struct controlling loading behavior. Key fields documented below.

llama_model_params fields (defined at include/llama.h:282-318):

Field Type Default Description
devices ggml_backend_dev_t * NULL NULL-terminated list of devices for offloading. NULL uses all available devices.
tensor_buft_overrides const struct llama_model_tensor_buft_override * NULL NULL-terminated list of buffer type overrides for tensors matching a pattern.
n_gpu_layers int32_t 0 Number of layers to offload to GPU. Negative value means all layers.
split_mode enum llama_split_mode LLAMA_SPLIT_MODE_LAYER How to split the model across multiple GPUs.
main_gpu int32_t 0 GPU index for LLAMA_SPLIT_MODE_NONE.
tensor_split const float * NULL Proportion of model to offload to each GPU.
progress_callback llama_progress_callback NULL Called with progress between 0.0 and 1.0. Return false to abort loading.
progress_callback_user_data void * NULL Context pointer passed to the progress callback.
kv_overrides const struct llama_model_kv_override * NULL Override key-value pairs of the model metadata.
vocab_only bool false Only load vocabulary, no weights.
use_mmap bool true Use memory-mapped I/O if possible.
use_direct_io bool false Use direct I/O, takes precedence over use_mmap when supported.
use_mlock bool false Force system to keep model in RAM (prevent swapping).
check_tensors bool false Validate model tensor data for NaN/Inf values.
use_extra_bufts bool false Use extra buffer types (for weight repacking).
no_host bool false Bypass host buffer allowing extra buffers to be used.
no_alloc bool false Only load metadata and simulate memory allocations.

Outputs

Return Type Description
model handle struct llama_model * Opaque pointer to the loaded model. Returns NULL on failure (file not found, invalid format, out of memory, or aborted by progress callback).

Usage Examples

Basic Model Loading (from examples/simple/simple.cpp)

#include "llama.h"

// Load dynamic backends first
ggml_backend_load_all();

// Set up model parameters
llama_model_params model_params = llama_model_default_params();
model_params.n_gpu_layers = 99;  // offload all layers to GPU

// Load the model
llama_model * model = llama_model_load_from_file(model_path.c_str(), model_params);
if (model == NULL) {
    fprintf(stderr, "error: unable to load model\n");
    return 1;
}

// Get vocabulary handle for tokenization
const llama_vocab * vocab = llama_model_get_vocab(model);

Loading with Progress Callback

bool my_progress(float progress, void * user_data) {
    printf("Loading: %.1f%%\r", progress * 100.0f);
    fflush(stdout);
    return true;  // return false to abort loading
}

llama_model_params params = llama_model_default_params();
params.n_gpu_layers = 35;
params.progress_callback = my_progress;
params.progress_callback_user_data = NULL;

llama_model * model = llama_model_load_from_file("model.gguf", params);

Vocabulary-Only Loading

llama_model_params params = llama_model_default_params();
params.vocab_only = true;  // skip loading weights

llama_model * model = llama_model_load_from_file("model.gguf", params);
const llama_vocab * vocab = llama_model_get_vocab(model);
// Use vocab for tokenization only, no inference possible

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment