Implementation:Ggml org Llama cpp Hparams Header
| Knowledge Sources | |
|---|---|
| Domains | Model_Architecture, Configuration |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Declares the `llama_hparams` struct containing all fixed model hyperparameters read from GGUF model files.
Description
This header defines an extensive set of model architecture parameters organized into the `llama_hparams` struct. It covers basic dimensions (n_embd, n_layer, n_head arrays, n_ff arrays), MoE expert configuration, normalization epsilon values, RoPE parameters and scaling, sliding window attention (SWA) configuration with multiple types (standard, chunked, symmetric), SSM/Mamba state parameters, RWKV-specific parameters, MLA (Multi-head Latent Attention) dimensions, per-layer recurrent/attention classification arrays, and architecture-specific parameters for Granite, Gemma3n, and Qwen3. It also defines enums for expert gating functions and SWA types, and uses fixed-size arrays with `LLAMA_MAX_LAYERS` (512) and `LLAMA_MAX_EXPERTS` (512) limits.
Usage
Include this header whenever you need access to model hyperparameters. Nearly every component in the inference pipeline depends on these parameters for tensor allocation, graph construction, and memory management.
Code Reference
Source Location
- Repository: Ggml_org_Llama_cpp
- File: src/llama-hparams.h
- Lines: 1-339
Signature
#define LLAMA_MAX_LAYERS 512
#define LLAMA_MAX_EXPERTS 512
enum llama_expert_gating_func_type {
LLAMA_EXPERT_GATING_FUNC_TYPE_NONE,
LLAMA_EXPERT_GATING_FUNC_TYPE_SOFTMAX,
LLAMA_EXPERT_GATING_FUNC_TYPE_SIGMOID,
LLAMA_EXPERT_GATING_FUNC_TYPE_SOFTMAX_WEIGHT,
};
enum llama_swa_type {
LLAMA_SWA_TYPE_NONE,
LLAMA_SWA_TYPE_STANDARD,
LLAMA_SWA_TYPE_CHUNKED,
LLAMA_SWA_TYPE_SYMMETRIC,
};
struct llama_hparams_posnet { uint32_t n_embd; uint32_t n_layer; };
struct llama_hparams_convnext { uint32_t n_embd; uint32_t n_layer; };
struct llama_hparams {
bool vocab_only;
bool rope_finetuned;
uint32_t n_ctx_train;
uint32_t n_embd;
uint32_t n_layer;
uint32_t n_embd_head_k;
uint32_t n_embd_head_v;
uint32_t n_expert;
uint32_t n_expert_used;
std::array<uint32_t, LLAMA_MAX_LAYERS> n_head_arr;
std::array<uint32_t, LLAMA_MAX_LAYERS> n_head_kv_arr;
std::array<uint32_t, LLAMA_MAX_LAYERS> n_ff_arr;
// ... (many more fields)
};
Import
#pragma once
#include "llama.h"
#include <array>
#include <cassert>
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| GGUF metadata | key-value pairs | Yes | Model hyperparameters read from GGUF file during model loading |
Outputs
| Name | Type | Description |
|---|---|---|
| hparams | llama_hparams | Fully populated hyperparameters struct used throughout the inference pipeline |
| n_head(il) | uint32_t | Per-layer attention head count accessor |
| n_head_kv(il) | uint32_t | Per-layer key-value head count accessor |
| n_ff(il) | uint32_t | Per-layer feed-forward dimension accessor |
| is_swa(il) | bool | Whether a given layer uses sliding window attention |
Usage Examples
// Access basic hyperparameters
const auto & hparams = model.hparams;
uint32_t n_embd = hparams.n_embd;
uint32_t n_layer = hparams.n_layer;
// Per-layer parameters
for (int il = 0; il < n_layer; il++) {
uint32_t n_head = hparams.n_head_arr[il];
uint32_t n_head_kv = hparams.n_head_kv_arr[il];
bool swa_layer = hparams.is_swa(il);
}
// Check SWA type
if (hparams.swa_type == LLAMA_SWA_TYPE_CHUNKED) {
// handle chunked sliding window attention
}