Principle:Ggml org Llama cpp HF to GGUF Conversion

Field	Value
Principle Name	HF to GGUF Conversion
Category	Model Format Transformation
Scope	Transforming HuggingFace transformer weights to GGUF format
Status	Active

Overview

Description

HF to GGUF conversion is the process of transforming a model stored in HuggingFace format (SafeTensors or PyTorch checkpoint files plus JSON configuration) into the GGUF (GGML Universal Format) binary format used by llama.cpp for inference. This conversion is the central step in making HuggingFace models runnable with llama.cpp.

The conversion process encompasses four major operations:

Tensor remapping: HuggingFace models use framework-specific tensor naming conventions (e.g., model.layers.0.self_attn.q_proj.weight). GGUF uses its own naming scheme (e.g., blk.0.attn_q.weight). The conversion must map every source tensor to its correct GGUF name, handling architecture-specific variations across hundreds of model types.

Data type conversion: Source tensors may be stored in float32, float16, or bfloat16. The conversion can optionally re-quantize tensors to different types including f32, f16, bf16, q8_0, tq1_0, and tq2_0. Certain tensor categories (1D tensors, normalization weights, embedding matrices) have special dtype rules that override the global output type.

Metadata embedding: GGUF files contain key-value metadata that describes the model architecture, hyperparameters, quantization version, and authorship information. This metadata is extracted from the HuggingFace config.json, model card, and other configuration files, then written as structured KV pairs in the GGUF header.

Vocabulary extraction: Tokenizer information (vocabulary, merge rules, special tokens) is extracted from HuggingFace tokenizer files and embedded directly in the GGUF file, making the output file self-contained for inference.

Usage

The conversion is invoked via the command-line interface of convert_hf_to_gguf.py:

python convert_hf_to_gguf.py /path/to/hf-model --outtype f16

The conversion follows this execution order:

Architecture detection: Read config.json to determine the model architecture and select the corresponding model class
Model instantiation: Create a ModelBase subclass instance that loads hyperparameters and indexes all tensors
Tensor preparation: Iterate over all source tensors, remap names, convert dtypes, and apply quantization
Metadata preparation: Extract and write model metadata, architecture parameters, and quantization version
File writing: Write the GGUF header, KV data, and tensor data to the output file

Theoretical Basis

Tensor Remapping

Model conversion requires a mapping from source tensor names to target tensor names. This is complicated by the diversity of HuggingFace model architectures -- the llama.cpp project supports hundreds of distinct architectures, each with its own naming conventions.

The mapping is handled by gguf.TensorNameMap, which provides a declarative mapping from architecture-neutral tensor categories (defined in gguf.MODEL_TENSOR) to architecture-specific source names. Each ModelBase subclass can further customize the mapping by overriding modify_tensors().

For example, a standard transformer layer maps as:

Source (HuggingFace):                    Target (GGUF):
model.layers.N.self_attn.q_proj.weight   blk.N.attn_q.weight
model.layers.N.self_attn.k_proj.weight   blk.N.attn_k.weight
model.layers.N.self_attn.v_proj.weight   blk.N.attn_v.weight
model.layers.N.self_attn.o_proj.weight   blk.N.attn_output.weight
model.layers.N.mlp.gate_proj.weight      blk.N.ffn_gate.weight
model.layers.N.mlp.up_proj.weight        blk.N.ffn_up.weight
model.layers.N.mlp.down_proj.weight      blk.N.ffn_down.weight
model.embed_tokens.weight                token_embd.weight
lm_head.weight                           output.weight

Data Type Conversion and Quantization

The output data type is controlled by the --outtype parameter. The conversion applies dtype rules in a priority cascade:

Force rules: Certain tensor categories are always stored as F32 regardless of the output type (1D tensors, normalization weights, positional embeddings, SSM convolution weights, MoE gate inputs).
Embedding rules: Token embeddings and output projections use F16 when the global type is a ternary quantization (TQ1_0, TQ2_0).
Global type: All remaining tensors use the type specified by --outtype.
Auto detection: When --outtype auto is specified, the script infers the dtype from the first multi-dimensional tensor encountered in the model.

Metadata Embedding

GGUF metadata serves as a self-describing header that enables inference engines to configure themselves without external configuration files. The metadata includes:

Architecture identifier: Which model architecture to use for inference
Hyperparameters: Hidden dimension, number of layers, number of attention heads, vocabulary size, context length, RoPE parameters, etc.
Quantization version: The GGML quantization format version for compatibility checking
Model identity: Name, author, description, license, size label, parameter count
Tokenizer data: Full vocabulary, merge rules, special token IDs, token types

File Structure

The GGUF file is written in three sequential phases:

Header: Magic number, version, tensor count, and metadata KV count
KV data: All metadata key-value pairs
Tensor data: Raw tensor bytes, aligned to allow memory-mapped access

For large models, the output can be split across multiple files using --split-max-tensors or --split-max-size, with each shard containing a subset of tensors and shared metadata.

Related Pages

Implementation:Ggml_org_Llama_cpp_ModelBase_Write

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment