Implementation:Ggml org Llama cpp Llama Model Load From File

Knowledge Sources	Domains	Last Updated
ggml-org/llama.cpp	GGUF Parsing, Model Deserialization, GPU Offloading	2026-02-14

Overview

Description

llama_model_load_from_file loads a quantized language model from a GGUF file on disk and returns an opaque model handle. This function reads the model's metadata (architecture, hyperparameters, tokenizer vocabulary), allocates tensor buffers across available compute backends, and populates those buffers with the deserialized weight data. The returned llama_model pointer is then used to create one or more inference contexts.

If the model is split across multiple files following the naming pattern <name>-00001-of-00005.gguf, the function automatically discovers and loads all split files. For custom naming schemes, use llama_model_load_from_splits instead.

Usage

#include "llama.h"

// Get default parameters and customize
llama_model_params params = llama_model_default_params();
params.n_gpu_layers = 99;  // offload all possible layers to GPU
params.use_mmap = true;    // enable memory-mapped I/O

// Load the model
llama_model * model = llama_model_load_from_file("path/to/model.gguf", params);
if (model == NULL) {
    fprintf(stderr, "Failed to load model\n");
    return 1;
}

// Use the model to create contexts, get vocab, etc.
const llama_vocab * vocab = llama_model_get_vocab(model);

// Free when done
llama_model_free(model);

Code Reference

Source Location

File	Line(s)	Type
`include/llama.h`	450-452	Declaration
`src/llama.cpp`	1029-1034	Implementation

Signature

LLAMA_API struct llama_model * llama_model_load_from_file(
                         const char * path_model,
          struct llama_model_params   params);

Import

#include "llama.h"

I/O Contract

Inputs

Parameter	Type	Description
`path_model`	`const char *`	Path to the GGUF model file. If the model is split into multiple parts, the file name must follow the pattern `<name>-%05d-of-%05d.gguf`.
`params`	`struct llama_model_params`	Configuration struct controlling loading behavior. Key fields documented below.

llama_model_params fields (defined at include/llama.h:282-318):

Field	Type	Default	Description
`devices`	`ggml_backend_dev_t *`	NULL	NULL-terminated list of devices for offloading. NULL uses all available devices.
`tensor_buft_overrides`	`const struct llama_model_tensor_buft_override *`	NULL	NULL-terminated list of buffer type overrides for tensors matching a pattern.
`n_gpu_layers`	`int32_t`	0	Number of layers to offload to GPU. Negative value means all layers.
`split_mode`	`enum llama_split_mode`	LLAMA_SPLIT_MODE_LAYER	How to split the model across multiple GPUs.
`main_gpu`	`int32_t`	0	GPU index for LLAMA_SPLIT_MODE_NONE.
`tensor_split`	`const float *`	NULL	Proportion of model to offload to each GPU.
`progress_callback`	`llama_progress_callback`	NULL	Called with progress between 0.0 and 1.0. Return false to abort loading.
`progress_callback_user_data`	`void *`	NULL	Context pointer passed to the progress callback.
`kv_overrides`	`const struct llama_model_kv_override *`	NULL	Override key-value pairs of the model metadata.
`vocab_only`	`bool`	false	Only load vocabulary, no weights.
`use_mmap`	`bool`	true	Use memory-mapped I/O if possible.
`use_direct_io`	`bool`	false	Use direct I/O, takes precedence over use_mmap when supported.
`use_mlock`	`bool`	false	Force system to keep model in RAM (prevent swapping).
`check_tensors`	`bool`	false	Validate model tensor data for NaN/Inf values.
`use_extra_bufts`	`bool`	false	Use extra buffer types (for weight repacking).
`no_host`	`bool`	false	Bypass host buffer allowing extra buffers to be used.
`no_alloc`	`bool`	false	Only load metadata and simulate memory allocations.

Outputs

Return	Type	Description
model handle	`struct llama_model *`	Opaque pointer to the loaded model. Returns NULL on failure (file not found, invalid format, out of memory, or aborted by progress callback).

Usage Examples

Basic Model Loading (from examples/simple/simple.cpp)

#include "llama.h"

// Load dynamic backends first
ggml_backend_load_all();

// Set up model parameters
llama_model_params model_params = llama_model_default_params();
model_params.n_gpu_layers = 99;  // offload all layers to GPU

// Load the model
llama_model * model = llama_model_load_from_file(model_path.c_str(), model_params);
if (model == NULL) {
    fprintf(stderr, "error: unable to load model\n");
    return 1;
}

// Get vocabulary handle for tokenization
const llama_vocab * vocab = llama_model_get_vocab(model);

Loading with Progress Callback

bool my_progress(float progress, void * user_data) {
    printf("Loading: %.1f%%\r", progress * 100.0f);
    fflush(stdout);
    return true;  // return false to abort loading
}

llama_model_params params = llama_model_default_params();
params.n_gpu_layers = 35;
params.progress_callback = my_progress;
params.progress_callback_user_data = NULL;

llama_model * model = llama_model_load_from_file("model.gguf", params);

Vocabulary-Only Loading

llama_model_params params = llama_model_default_params();
params.vocab_only = true;  // skip loading weights

llama_model * model = llama_model_load_from_file("model.gguf", params);
const llama_vocab * vocab = llama_model_get_vocab(model);
// Use vocab for tokenization only, no inference possible

Related Pages

Principle:Ggml_org_Llama_cpp_GGUF_Model_Loading
Implementation:Ggml_org_Llama_cpp_Ggml_Backend_Load_All -- must be called before model loading
Implementation:Ggml_org_Llama_cpp_Llama_Init_From_Model -- creates an inference context from the loaded model
Environment:Ggml_org_Llama_cpp_CUDA_GPU_Environment
Environment:Ggml_org_Llama_cpp_Metal_GPU_Environment
Environment:Ggml_org_Llama_cpp_Vulkan_GPU_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment