| Knowledge Sources |
Domains |
Last Updated
|
| ggml-org/llama.cpp |
GGUF Parsing, Model Deserialization, GPU Offloading |
2026-02-14
|
Overview
Description
llama_model_load_from_file loads a quantized language model from a GGUF file on disk and returns an opaque model handle. This function reads the model's metadata (architecture, hyperparameters, tokenizer vocabulary), allocates tensor buffers across available compute backends, and populates those buffers with the deserialized weight data. The returned llama_model pointer is then used to create one or more inference contexts.
If the model is split across multiple files following the naming pattern <name>-00001-of-00005.gguf, the function automatically discovers and loads all split files. For custom naming schemes, use llama_model_load_from_splits instead.
Usage
#include "llama.h"
// Get default parameters and customize
llama_model_params params = llama_model_default_params();
params.n_gpu_layers = 99; // offload all possible layers to GPU
params.use_mmap = true; // enable memory-mapped I/O
// Load the model
llama_model * model = llama_model_load_from_file("path/to/model.gguf", params);
if (model == NULL) {
fprintf(stderr, "Failed to load model\n");
return 1;
}
// Use the model to create contexts, get vocab, etc.
const llama_vocab * vocab = llama_model_get_vocab(model);
// Free when done
llama_model_free(model);
Code Reference
Source Location
| File |
Line(s) |
Type
|
include/llama.h |
450-452 |
Declaration
|
src/llama.cpp |
1029-1034 |
Implementation
|
Signature
LLAMA_API struct llama_model * llama_model_load_from_file(
const char * path_model,
struct llama_model_params params);
Import
I/O Contract
Inputs
| Parameter |
Type |
Description
|
path_model |
const char * |
Path to the GGUF model file. If the model is split into multiple parts, the file name must follow the pattern <name>-%05d-of-%05d.gguf.
|
params |
struct llama_model_params |
Configuration struct controlling loading behavior. Key fields documented below.
|
llama_model_params fields (defined at include/llama.h:282-318):
| Field |
Type |
Default |
Description
|
devices |
ggml_backend_dev_t * |
NULL |
NULL-terminated list of devices for offloading. NULL uses all available devices.
|
tensor_buft_overrides |
const struct llama_model_tensor_buft_override * |
NULL |
NULL-terminated list of buffer type overrides for tensors matching a pattern.
|
n_gpu_layers |
int32_t |
0 |
Number of layers to offload to GPU. Negative value means all layers.
|
split_mode |
enum llama_split_mode |
LLAMA_SPLIT_MODE_LAYER |
How to split the model across multiple GPUs.
|
main_gpu |
int32_t |
0 |
GPU index for LLAMA_SPLIT_MODE_NONE.
|
tensor_split |
const float * |
NULL |
Proportion of model to offload to each GPU.
|
progress_callback |
llama_progress_callback |
NULL |
Called with progress between 0.0 and 1.0. Return false to abort loading.
|
progress_callback_user_data |
void * |
NULL |
Context pointer passed to the progress callback.
|
kv_overrides |
const struct llama_model_kv_override * |
NULL |
Override key-value pairs of the model metadata.
|
vocab_only |
bool |
false |
Only load vocabulary, no weights.
|
use_mmap |
bool |
true |
Use memory-mapped I/O if possible.
|
use_direct_io |
bool |
false |
Use direct I/O, takes precedence over use_mmap when supported.
|
use_mlock |
bool |
false |
Force system to keep model in RAM (prevent swapping).
|
check_tensors |
bool |
false |
Validate model tensor data for NaN/Inf values.
|
use_extra_bufts |
bool |
false |
Use extra buffer types (for weight repacking).
|
no_host |
bool |
false |
Bypass host buffer allowing extra buffers to be used.
|
no_alloc |
bool |
false |
Only load metadata and simulate memory allocations.
|
Outputs
| Return |
Type |
Description
|
| model handle |
struct llama_model * |
Opaque pointer to the loaded model. Returns NULL on failure (file not found, invalid format, out of memory, or aborted by progress callback).
|
Usage Examples
Basic Model Loading (from examples/simple/simple.cpp)
#include "llama.h"
// Load dynamic backends first
ggml_backend_load_all();
// Set up model parameters
llama_model_params model_params = llama_model_default_params();
model_params.n_gpu_layers = 99; // offload all layers to GPU
// Load the model
llama_model * model = llama_model_load_from_file(model_path.c_str(), model_params);
if (model == NULL) {
fprintf(stderr, "error: unable to load model\n");
return 1;
}
// Get vocabulary handle for tokenization
const llama_vocab * vocab = llama_model_get_vocab(model);
Loading with Progress Callback
bool my_progress(float progress, void * user_data) {
printf("Loading: %.1f%%\r", progress * 100.0f);
fflush(stdout);
return true; // return false to abort loading
}
llama_model_params params = llama_model_default_params();
params.n_gpu_layers = 35;
params.progress_callback = my_progress;
params.progress_callback_user_data = NULL;
llama_model * model = llama_model_load_from_file("model.gguf", params);
Vocabulary-Only Loading
llama_model_params params = llama_model_default_params();
params.vocab_only = true; // skip loading weights
llama_model * model = llama_model_load_from_file("model.gguf", params);
const llama_vocab * vocab = llama_model_get_vocab(model);
// Use vocab for tokenization only, no inference possible
Related Pages
Page Connections
Double-click a node to navigate. Hold to expand connections.