Implementation:Ggml org Llama cpp Model Loader Header

Knowledge Sources	Ggml_org_Llama_cpp
Domains	Model_Loading
Last Updated	2026-02-15 00:00 GMT

Overview

Declares the `llama_model_loader` struct and its supporting types for reading GGUF model files and managing tensor weight data.

Description

This header defines the core model loading interface used by llama.cpp to deserialize GGUF files into in-memory model representations. It includes the `llama_tensor_weight` struct for tracking source file indices and data offsets, the `weight_name_comparer` for layer-aware sorting of tensor names, and the main `llama_model_loader` struct with template methods for reading typed metadata (`get_key`, `get_arr`, `get_key_or_arr`), tensor lifecycle methods (`create_tensor`, `create_tensor_as_view`, `done_getting_tensors`), and data loading methods (`init_mappings`, `load_all_data`). The loader also supports KV overrides and tensor buffer type overrides via configurable parameters.

Usage

Use this header when implementing model loading from GGUF files, accessing model metadata, or creating tensor objects from serialized weight data. It is the primary interface through which all model weights and metadata flow into the `llama_model` object.

Code Reference

Source Location

Repository: Ggml_org_Llama_cpp
File: src/llama-model-loader.h
Lines: 1-176

Signature

struct llama_model_loader {
    struct llama_tensor_weight { ... };
    struct weight_name_comparer { ... };

    static const int TENSOR_NOT_REQUIRED = 1 << 0;
    static const int TENSOR_DUPLICATED   = 1 << 1;
    static const int TENSOR_SKIP         = 1 << 2;

    llama_model_loader(
        const std::string & fname,
        std::vector<std::string> & splits,
        bool use_mmap, bool use_direct_io,
        bool check_tensors, bool no_alloc,
        const llama_model_kv_override * param_overrides_p,
        const llama_model_tensor_buft_override * param_tensor_buft_overrides_p);

    template<typename T> bool get_key(const std::string & key, T & result, bool required = true);
    template<typename T> bool get_arr(const std::string & key, std::vector<T> & result, bool required = true);
    template<typename T, size_t N_MAX> bool get_key_or_arr(const std::string & key, std::array<T, N_MAX> & result, uint32_t n, bool required = true);

    struct ggml_tensor * create_tensor(struct ggml_context * ctx, const std::string & name, const std::initializer_list<int64_t> & ne, int flags = 0);
    struct ggml_tensor * create_tensor_as_view(struct ggml_context * ctx, struct ggml_tensor * base, const std::string & name, const std::initializer_list<int64_t> & ne, size_t offset, bool required = true);
    void done_getting_tensors() const;

    void init_mappings(bool prefetch = true, llama_mlocks * mlock_mmaps = nullptr);
    bool load_all_data(struct ggml_context * ctx, llama_buf_map & bufs, llama_mlocks * lmlocks, llama_progress_callback progress_callback, void * progress_callback_user_data);
};

Import

#include "llama-model-loader.h"

I/O Contract

Inputs

Name	Type	Required	Description
fname	const std::string &	Yes	Path to the GGUF model file
splits	std::vector<std::string> &	Yes	Optional split file paths if not following naming scheme
use_mmap	bool	Yes	Whether to use memory-mapped I/O for loading
use_direct_io	bool	Yes	Whether to use direct I/O (bypass OS cache)
check_tensors	bool	Yes	Whether to validate tensor data integrity
no_alloc	bool	Yes	Whether to skip memory allocation for tensor data
param_overrides_p	const llama_model_kv_override *	No	Optional KV metadata overrides
param_tensor_buft_overrides_p	const llama_model_tensor_buft_override *	No	Optional tensor buffer type overrides

Outputs

Name	Type	Description
llama_model_loader (object)	llama_model_loader	Initialized loader with parsed GGUF metadata, tensor map, and file mappings
load_all_data (return)	bool	Returns false if loading was cancelled by the progress callback

Usage Examples

// Construct a model loader from a GGUF file
std::vector<std::string> splits;
llama_model_loader ml("model.gguf", splits,
    /*use_mmap=*/true, /*use_direct_io=*/false,
    /*check_tensors=*/true, /*no_alloc=*/false,
    nullptr, nullptr);

// Read a metadata key
uint32_t n_embd;
ml.get_key(LLM_KV_EMBEDDING_LENGTH, n_embd);

// Create a tensor from the loaded weights
auto * tensor = ml.create_tensor(ctx, "blk.0.attn_q.weight", {n_embd, n_embd});

// Finalize tensor creation and load data
ml.done_getting_tensors();
ml.init_mappings();
ml.load_all_data(ctx, bufs, nullptr, nullptr, nullptr);

Related Pages

Principle:Ggml_org_Llama_cpp_Model_Loading

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment