Implementation:Ggml org Llama cpp Hparams Header

Knowledge Sources	Ggml_org_Llama_cpp
Domains	Model_Architecture, Configuration
Last Updated	2026-02-15 00:00 GMT

Overview

Declares the `llama_hparams` struct containing all fixed model hyperparameters read from GGUF model files.

Description

This header defines an extensive set of model architecture parameters organized into the `llama_hparams` struct. It covers basic dimensions (n_embd, n_layer, n_head arrays, n_ff arrays), MoE expert configuration, normalization epsilon values, RoPE parameters and scaling, sliding window attention (SWA) configuration with multiple types (standard, chunked, symmetric), SSM/Mamba state parameters, RWKV-specific parameters, MLA (Multi-head Latent Attention) dimensions, per-layer recurrent/attention classification arrays, and architecture-specific parameters for Granite, Gemma3n, and Qwen3. It also defines enums for expert gating functions and SWA types, and uses fixed-size arrays with `LLAMA_MAX_LAYERS` (512) and `LLAMA_MAX_EXPERTS` (512) limits.

Usage

Include this header whenever you need access to model hyperparameters. Nearly every component in the inference pipeline depends on these parameters for tensor allocation, graph construction, and memory management.

Code Reference

Source Location

Repository: Ggml_org_Llama_cpp
File: src/llama-hparams.h
Lines: 1-339

Signature

#define LLAMA_MAX_LAYERS  512
#define LLAMA_MAX_EXPERTS 512

enum llama_expert_gating_func_type {
    LLAMA_EXPERT_GATING_FUNC_TYPE_NONE,
    LLAMA_EXPERT_GATING_FUNC_TYPE_SOFTMAX,
    LLAMA_EXPERT_GATING_FUNC_TYPE_SIGMOID,
    LLAMA_EXPERT_GATING_FUNC_TYPE_SOFTMAX_WEIGHT,
};

enum llama_swa_type {
    LLAMA_SWA_TYPE_NONE,
    LLAMA_SWA_TYPE_STANDARD,
    LLAMA_SWA_TYPE_CHUNKED,
    LLAMA_SWA_TYPE_SYMMETRIC,
};

struct llama_hparams_posnet { uint32_t n_embd; uint32_t n_layer; };
struct llama_hparams_convnext { uint32_t n_embd; uint32_t n_layer; };

struct llama_hparams {
    bool vocab_only;
    bool rope_finetuned;
    uint32_t n_ctx_train;
    uint32_t n_embd;
    uint32_t n_layer;
    uint32_t n_embd_head_k;
    uint32_t n_embd_head_v;
    uint32_t n_expert;
    uint32_t n_expert_used;
    std::array<uint32_t, LLAMA_MAX_LAYERS> n_head_arr;
    std::array<uint32_t, LLAMA_MAX_LAYERS> n_head_kv_arr;
    std::array<uint32_t, LLAMA_MAX_LAYERS> n_ff_arr;
    // ... (many more fields)
};

Import

#pragma once
#include "llama.h"
#include <array>
#include <cassert>

I/O Contract

Inputs

Name	Type	Required	Description
GGUF metadata	key-value pairs	Yes	Model hyperparameters read from GGUF file during model loading

Outputs

Name	Type	Description
hparams	llama_hparams	Fully populated hyperparameters struct used throughout the inference pipeline
n_head(il)	uint32_t	Per-layer attention head count accessor
n_head_kv(il)	uint32_t	Per-layer key-value head count accessor
n_ff(il)	uint32_t	Per-layer feed-forward dimension accessor
is_swa(il)	bool	Whether a given layer uses sliding window attention

Usage Examples

// Access basic hyperparameters
const auto & hparams = model.hparams;
uint32_t n_embd  = hparams.n_embd;
uint32_t n_layer = hparams.n_layer;

// Per-layer parameters
for (int il = 0; il < n_layer; il++) {
    uint32_t n_head    = hparams.n_head_arr[il];
    uint32_t n_head_kv = hparams.n_head_kv_arr[il];
    bool swa_layer     = hparams.is_swa(il);
}

// Check SWA type
if (hparams.swa_type == LLAMA_SWA_TYPE_CHUNKED) {
    // handle chunked sliding window attention
}

Related Pages

Principle:Ggml_org_Llama_cpp_ModelArchitecture

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment