Implementation:Ggml org Llama cpp Model Header

Knowledge Sources	Ggml_org_Llama_cpp
Domains	Model_Architecture
Last Updated	2026-02-15 00:00 GMT

Overview

Declares the `llama_model` struct and all supporting layer/tensor structures that define the in-memory representation of a loaded model.

Description

This header defines the `llm_type` enum for model size classification (14M through 405B and beyond), per-layer tensor structures (`llama_layer` with attention, FFN, SSM, MoE, and specialized tensors), auxiliary structures (`llama_layer_posnet`, `llama_layer_convnext`, `llama_layer_shortconv`, `llama_layer_nextn`), and the main `llama_model` struct. The model holds hyperparameters, vocabulary, global tensors (embeddings, output norms), a vector of layers, device assignments, and LoRA tracking. It provides methods for loading (`load_arch`, `load_hparams`, `load_vocab`, `load_tensors`), querying model properties (size, desc, n_tensors), creating memory backends, and building compute graphs.

Usage

Include this header when working with model data structures. It defines the foundational data structure that every model architecture populates during loading and that the inference pipeline reads from during graph construction and evaluation.

Code Reference

Source Location

Repository: Ggml_org_Llama_cpp
File: src/llama-model.h
Lines: 1-571

Signature

enum llm_type { LLM_TYPE_UNKNOWN, LLM_TYPE_14M, ..., LLM_TYPE_405B, ... };

struct llama_layer_posnet { /* posnet tensor pointers */ };
struct llama_layer_convnext { /* convnext tensor pointers */ };
struct llama_layer_shortconv { /* short convolution tensor pointers */ };
struct llama_layer_nextn { /* nextn tensor pointers */ };

struct llama_layer {
    // Attention tensors: wq, wk, wv, wo, ...
    // FFN tensors: ffn_gate, ffn_down, ffn_up, ...
    // SSM tensors for Mamba/RWKV
    // MoE expert tensors
    // Normalization tensors
};

struct llama_model {
    llm_type type = LLM_TYPE_UNKNOWN;
    llm_arch arch = LLM_ARCH_UNKNOWN;
    std::string name;
    llama_hparams hparams;
    llama_vocab vocab;
    std::vector<llama_layer> layers;
    std::vector<ggml_backend_dev_t> devices;

    void load_arch(llama_model_loader & ml);
    void load_hparams(llama_model_loader & ml);
    void load_vocab(llama_model_loader & ml);
    bool load_tensors(llama_model_loader & ml);

    std::string arch_name() const;
    std::string type_name() const;
    std::string desc() const;
    size_t size() const;
    size_t n_tensors() const;
    size_t n_devices() const;

    const ggml_tensor * get_tensor(const char * name) const;
    ggml_tensor * get_rope_factors(const llama_cparams & cparams, int il) const;
    llama_memory_i * create_memory(const llama_memory_params & params, const llama_cparams & cparams) const;
};

Import

#pragma once
#include "llama.h"
#include "llama-arch.h"
#include "llama-graph.h"
#include "llama-hparams.h"
#include "llama-memory.h"
#include "llama-vocab.h"
#include <map>
#include <memory>
#include <string>
#include <unordered_map>
#include <unordered_set>
#include <vector>

I/O Contract

Inputs

Name	Type	Required	Description
ml	llama_model_loader &	Yes	Model loader providing GGUF file data for populating model structures
params	llama_model_params	Yes	Loading parameters (GPU layers, device assignments, mmap settings)

Outputs

Name	Type	Description
model	llama_model	Fully loaded model with hyperparameters, vocabulary, layers, and tensors
desc	std::string	Human-readable model description string
size	size_t	Total model file size in bytes
n_tensors	size_t	Total number of tensors in the model
memory	llama_memory_i *	Created memory backend (KV cache or recurrent) for inference

Usage Examples

// Access model properties
const llama_model & model = *llama_model_ptr;
std::string description = model.desc();
size_t model_size = model.size();
size_t tensor_count = model.n_tensors();

// Access layers
for (const auto & layer : model.layers) {
    // layer.wq, layer.wk, layer.wv, layer.wo -- attention tensors
    // layer.ffn_gate, layer.ffn_down, layer.ffn_up -- FFN tensors
}

// Create memory backend for inference
llama_memory_i * memory = model.create_memory(mem_params, cparams);

// Get a specific tensor by name
const ggml_tensor * tensor = model.get_tensor("blk.0.attn_q.weight");

Related Pages

Principle:Ggml_org_Llama_cpp_ModelArchitecture

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment