Principle:Ggml org Llama cpp HF to GGUF Conversion
| Field | Value |
|---|---|
| Principle Name | HF to GGUF Conversion |
| Category | Model Format Transformation |
| Scope | Transforming HuggingFace transformer weights to GGUF format |
| Status | Active |
Overview
Description
HF to GGUF conversion is the process of transforming a model stored in HuggingFace format (SafeTensors or PyTorch checkpoint files plus JSON configuration) into the GGUF (GGML Universal Format) binary format used by llama.cpp for inference. This conversion is the central step in making HuggingFace models runnable with llama.cpp.
The conversion process encompasses four major operations:
- Tensor remapping: HuggingFace models use framework-specific tensor naming conventions (e.g.,
model.layers.0.self_attn.q_proj.weight). GGUF uses its own naming scheme (e.g.,blk.0.attn_q.weight). The conversion must map every source tensor to its correct GGUF name, handling architecture-specific variations across hundreds of model types.
- Data type conversion: Source tensors may be stored in float32, float16, or bfloat16. The conversion can optionally re-quantize tensors to different types including f32, f16, bf16, q8_0, tq1_0, and tq2_0. Certain tensor categories (1D tensors, normalization weights, embedding matrices) have special dtype rules that override the global output type.
- Metadata embedding: GGUF files contain key-value metadata that describes the model architecture, hyperparameters, quantization version, and authorship information. This metadata is extracted from the HuggingFace
config.json, model card, and other configuration files, then written as structured KV pairs in the GGUF header.
- Vocabulary extraction: Tokenizer information (vocabulary, merge rules, special tokens) is extracted from HuggingFace tokenizer files and embedded directly in the GGUF file, making the output file self-contained for inference.
Usage
The conversion is invoked via the command-line interface of convert_hf_to_gguf.py:
python convert_hf_to_gguf.py /path/to/hf-model --outtype f16
The conversion follows this execution order:
- Architecture detection: Read
config.jsonto determine the model architecture and select the corresponding model class - Model instantiation: Create a
ModelBasesubclass instance that loads hyperparameters and indexes all tensors - Tensor preparation: Iterate over all source tensors, remap names, convert dtypes, and apply quantization
- Metadata preparation: Extract and write model metadata, architecture parameters, and quantization version
- File writing: Write the GGUF header, KV data, and tensor data to the output file
Theoretical Basis
Tensor Remapping
Model conversion requires a mapping from source tensor names to target tensor names. This is complicated by the diversity of HuggingFace model architectures -- the llama.cpp project supports hundreds of distinct architectures, each with its own naming conventions.
The mapping is handled by gguf.TensorNameMap, which provides a declarative mapping from architecture-neutral tensor categories (defined in gguf.MODEL_TENSOR) to architecture-specific source names. Each ModelBase subclass can further customize the mapping by overriding modify_tensors().
For example, a standard transformer layer maps as:
Source (HuggingFace): Target (GGUF):
model.layers.N.self_attn.q_proj.weight blk.N.attn_q.weight
model.layers.N.self_attn.k_proj.weight blk.N.attn_k.weight
model.layers.N.self_attn.v_proj.weight blk.N.attn_v.weight
model.layers.N.self_attn.o_proj.weight blk.N.attn_output.weight
model.layers.N.mlp.gate_proj.weight blk.N.ffn_gate.weight
model.layers.N.mlp.up_proj.weight blk.N.ffn_up.weight
model.layers.N.mlp.down_proj.weight blk.N.ffn_down.weight
model.embed_tokens.weight token_embd.weight
lm_head.weight output.weight
Data Type Conversion and Quantization
The output data type is controlled by the --outtype parameter. The conversion applies dtype rules in a priority cascade:
- Force rules: Certain tensor categories are always stored as F32 regardless of the output type (1D tensors, normalization weights, positional embeddings, SSM convolution weights, MoE gate inputs).
- Embedding rules: Token embeddings and output projections use F16 when the global type is a ternary quantization (TQ1_0, TQ2_0).
- Global type: All remaining tensors use the type specified by
--outtype. - Auto detection: When
--outtype autois specified, the script infers the dtype from the first multi-dimensional tensor encountered in the model.
Metadata Embedding
GGUF metadata serves as a self-describing header that enables inference engines to configure themselves without external configuration files. The metadata includes:
- Architecture identifier: Which model architecture to use for inference
- Hyperparameters: Hidden dimension, number of layers, number of attention heads, vocabulary size, context length, RoPE parameters, etc.
- Quantization version: The GGML quantization format version for compatibility checking
- Model identity: Name, author, description, license, size label, parameter count
- Tokenizer data: Full vocabulary, merge rules, special token IDs, token types
File Structure
The GGUF file is written in three sequential phases:
- Header: Magic number, version, tensor count, and metadata KV count
- KV data: All metadata key-value pairs
- Tensor data: Raw tensor bytes, aligned to allow memory-mapped access
For large models, the output can be split across multiple files using --split-max-tensors or --split-max-size, with each shard containing a subset of tensors and shared metadata.