Implementation:Ggml org Llama cpp Model Header
| Knowledge Sources | |
|---|---|
| Domains | Model_Architecture |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Declares the `llama_model` struct and all supporting layer/tensor structures that define the in-memory representation of a loaded model.
Description
This header defines the `llm_type` enum for model size classification (14M through 405B and beyond), per-layer tensor structures (`llama_layer` with attention, FFN, SSM, MoE, and specialized tensors), auxiliary structures (`llama_layer_posnet`, `llama_layer_convnext`, `llama_layer_shortconv`, `llama_layer_nextn`), and the main `llama_model` struct. The model holds hyperparameters, vocabulary, global tensors (embeddings, output norms), a vector of layers, device assignments, and LoRA tracking. It provides methods for loading (`load_arch`, `load_hparams`, `load_vocab`, `load_tensors`), querying model properties (size, desc, n_tensors), creating memory backends, and building compute graphs.
Usage
Include this header when working with model data structures. It defines the foundational data structure that every model architecture populates during loading and that the inference pipeline reads from during graph construction and evaluation.
Code Reference
Source Location
- Repository: Ggml_org_Llama_cpp
- File: src/llama-model.h
- Lines: 1-571
Signature
enum llm_type { LLM_TYPE_UNKNOWN, LLM_TYPE_14M, ..., LLM_TYPE_405B, ... };
struct llama_layer_posnet { /* posnet tensor pointers */ };
struct llama_layer_convnext { /* convnext tensor pointers */ };
struct llama_layer_shortconv { /* short convolution tensor pointers */ };
struct llama_layer_nextn { /* nextn tensor pointers */ };
struct llama_layer {
// Attention tensors: wq, wk, wv, wo, ...
// FFN tensors: ffn_gate, ffn_down, ffn_up, ...
// SSM tensors for Mamba/RWKV
// MoE expert tensors
// Normalization tensors
};
struct llama_model {
llm_type type = LLM_TYPE_UNKNOWN;
llm_arch arch = LLM_ARCH_UNKNOWN;
std::string name;
llama_hparams hparams;
llama_vocab vocab;
std::vector<llama_layer> layers;
std::vector<ggml_backend_dev_t> devices;
void load_arch(llama_model_loader & ml);
void load_hparams(llama_model_loader & ml);
void load_vocab(llama_model_loader & ml);
bool load_tensors(llama_model_loader & ml);
std::string arch_name() const;
std::string type_name() const;
std::string desc() const;
size_t size() const;
size_t n_tensors() const;
size_t n_devices() const;
const ggml_tensor * get_tensor(const char * name) const;
ggml_tensor * get_rope_factors(const llama_cparams & cparams, int il) const;
llama_memory_i * create_memory(const llama_memory_params & params, const llama_cparams & cparams) const;
};
Import
#pragma once
#include "llama.h"
#include "llama-arch.h"
#include "llama-graph.h"
#include "llama-hparams.h"
#include "llama-memory.h"
#include "llama-vocab.h"
#include <map>
#include <memory>
#include <string>
#include <unordered_map>
#include <unordered_set>
#include <vector>
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| ml | llama_model_loader & | Yes | Model loader providing GGUF file data for populating model structures |
| params | llama_model_params | Yes | Loading parameters (GPU layers, device assignments, mmap settings) |
Outputs
| Name | Type | Description |
|---|---|---|
| model | llama_model | Fully loaded model with hyperparameters, vocabulary, layers, and tensors |
| desc | std::string | Human-readable model description string |
| size | size_t | Total model file size in bytes |
| n_tensors | size_t | Total number of tensors in the model |
| memory | llama_memory_i * | Created memory backend (KV cache or recurrent) for inference |
Usage Examples
// Access model properties
const llama_model & model = *llama_model_ptr;
std::string description = model.desc();
size_t model_size = model.size();
size_t tensor_count = model.n_tensors();
// Access layers
for (const auto & layer : model.layers) {
// layer.wq, layer.wk, layer.wv, layer.wo -- attention tensors
// layer.ffn_gate, layer.ffn_down, layer.ffn_up -- FFN tensors
}
// Create memory backend for inference
llama_memory_i * memory = model.create_memory(mem_params, cparams);
// Get a specific tensor by name
const ggml_tensor * tensor = model.get_tensor("blk.0.attn_q.weight");