Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ggml org Llama cpp Model Header

From Leeroopedia
Knowledge Sources
Domains Model_Architecture
Last Updated 2026-02-15 00:00 GMT

Overview

Declares the `llama_model` struct and all supporting layer/tensor structures that define the in-memory representation of a loaded model.

Description

This header defines the `llm_type` enum for model size classification (14M through 405B and beyond), per-layer tensor structures (`llama_layer` with attention, FFN, SSM, MoE, and specialized tensors), auxiliary structures (`llama_layer_posnet`, `llama_layer_convnext`, `llama_layer_shortconv`, `llama_layer_nextn`), and the main `llama_model` struct. The model holds hyperparameters, vocabulary, global tensors (embeddings, output norms), a vector of layers, device assignments, and LoRA tracking. It provides methods for loading (`load_arch`, `load_hparams`, `load_vocab`, `load_tensors`), querying model properties (size, desc, n_tensors), creating memory backends, and building compute graphs.

Usage

Include this header when working with model data structures. It defines the foundational data structure that every model architecture populates during loading and that the inference pipeline reads from during graph construction and evaluation.

Code Reference

Source Location

Signature

enum llm_type { LLM_TYPE_UNKNOWN, LLM_TYPE_14M, ..., LLM_TYPE_405B, ... };

struct llama_layer_posnet { /* posnet tensor pointers */ };
struct llama_layer_convnext { /* convnext tensor pointers */ };
struct llama_layer_shortconv { /* short convolution tensor pointers */ };
struct llama_layer_nextn { /* nextn tensor pointers */ };

struct llama_layer {
    // Attention tensors: wq, wk, wv, wo, ...
    // FFN tensors: ffn_gate, ffn_down, ffn_up, ...
    // SSM tensors for Mamba/RWKV
    // MoE expert tensors
    // Normalization tensors
};

struct llama_model {
    llm_type type = LLM_TYPE_UNKNOWN;
    llm_arch arch = LLM_ARCH_UNKNOWN;
    std::string name;
    llama_hparams hparams;
    llama_vocab vocab;
    std::vector<llama_layer> layers;
    std::vector<ggml_backend_dev_t> devices;

    void load_arch(llama_model_loader & ml);
    void load_hparams(llama_model_loader & ml);
    void load_vocab(llama_model_loader & ml);
    bool load_tensors(llama_model_loader & ml);

    std::string arch_name() const;
    std::string type_name() const;
    std::string desc() const;
    size_t size() const;
    size_t n_tensors() const;
    size_t n_devices() const;

    const ggml_tensor * get_tensor(const char * name) const;
    ggml_tensor * get_rope_factors(const llama_cparams & cparams, int il) const;
    llama_memory_i * create_memory(const llama_memory_params & params, const llama_cparams & cparams) const;
};

Import

#pragma once
#include "llama.h"
#include "llama-arch.h"
#include "llama-graph.h"
#include "llama-hparams.h"
#include "llama-memory.h"
#include "llama-vocab.h"
#include <map>
#include <memory>
#include <string>
#include <unordered_map>
#include <unordered_set>
#include <vector>

I/O Contract

Inputs

Name Type Required Description
ml llama_model_loader & Yes Model loader providing GGUF file data for populating model structures
params llama_model_params Yes Loading parameters (GPU layers, device assignments, mmap settings)

Outputs

Name Type Description
model llama_model Fully loaded model with hyperparameters, vocabulary, layers, and tensors
desc std::string Human-readable model description string
size size_t Total model file size in bytes
n_tensors size_t Total number of tensors in the model
memory llama_memory_i * Created memory backend (KV cache or recurrent) for inference

Usage Examples

// Access model properties
const llama_model & model = *llama_model_ptr;
std::string description = model.desc();
size_t model_size = model.size();
size_t tensor_count = model.n_tensors();

// Access layers
for (const auto & layer : model.layers) {
    // layer.wq, layer.wk, layer.wv, layer.wo -- attention tensors
    // layer.ffn_gate, layer.ffn_down, layer.ffn_up -- FFN tensors
}

// Create memory backend for inference
llama_memory_i * memory = model.create_memory(mem_params, cparams);

// Get a specific tensor by name
const ggml_tensor * tensor = model.get_tensor("blk.0.attn_q.weight");

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment