Implementation:Ggml org Llama cpp ModelBase Write

Field	Value
Implementation Name	ModelBase Write
Type	API Doc
Component	`convert_hf_to_gguf.py` -- `ModelBase` class
Status	Active

Overview

Description

The ModelBase class is the core abstraction in llama.cpp's HuggingFace-to-GGUF conversion pipeline. It provides the __init__() constructor for loading model data and the write() method for executing the full conversion. All architecture-specific model classes (e.g., LlamaModel, MistralModel, Qwen2Model) inherit from ModelBase (via TextModel or MmprojModel) and override hooks like set_gguf_parameters(), modify_tensors(), and set_vocab().

The write() method orchestrates the three-phase output process: tensor preparation, metadata preparation, and sequential file writing (header, KV data, tensor data).

The entry point is the main() function (lines 11828-11930), which parses CLI arguments, determines the model class, instantiates it, and calls either write() or write_vocab().

Usage

Command-line invocation:

python convert_hf_to_gguf.py <model_dir_or_repo_id> [options]

Key CLI parameters:

Parameter	Type	Default	Description
`model`	positional	required	Directory containing model files, or HuggingFace repo ID (with `--remote`)
`--outtype`	choice	`auto`	Output format: `f32`, `f16`, `bf16`, `q8_0`, `tq1_0`, `tq2_0`, `auto`
`--outfile`	path	auto-generated	Output file path; `{ftype}` is replaced by the output type
`--model-name`	string	`None`	Custom model name for GGUF metadata
`--vocab-only`	flag	`False`	Extract only the vocabulary, skip tensor conversion
`--split-max-tensors`	int	`0`	Maximum tensors per output shard (0 = no splitting)
`--split-max-size`	string	`"0"`	Maximum size per shard, e.g., `2G`, `500M` (0 = no splitting)
`--remote`	flag	`False`	Read tensors remotely from HuggingFace Hub via HTTP
`--mmproj`	flag	`False`	Export multimodal projector for vision models
`--bigendian`	flag	`False`	Target big-endian byte order
`--use-temp-file`	flag	`False`	Use temp files to reduce memory usage
`--no-lazy`	flag	`False`	Disable lazy tensor evaluation (uses more RAM)
`--dry-run`	flag	`False`	Print split plan without writing files
`--no-tensor-first-split`	flag	`False`	Do not add tensors to the first split shard
`--metadata`	path	`None`	Path to metadata override file
`--mistral-format`	flag	`False`	Model uses Mistral native format
`--print-supported-models`	flag	`False`	Print all supported model architectures and exit
`--sentence-transformers-dense-modules`	flag	`False`	Include sentence-transformer dense modules

Code Reference

Source Location

File	Lines	Description
`convert_hf_to_gguf.py`	79	`ModelBase` class definition
`convert_hf_to_gguf.py`	113-168	`ModelBase.__init__()` constructor
`convert_hf_to_gguf.py`	527-650	`ModelBase.prepare_tensors()` method
`convert_hf_to_gguf.py`	655-682	`ModelBase.prepare_metadata()` method
`convert_hf_to_gguf.py`	687-693	`ModelBase.write()` method
`convert_hf_to_gguf.py`	11691-11785	`parse_args()` function
`convert_hf_to_gguf.py`	11828-11930	`main()` entry point

Signature

ModelBase.__init__():

def __init__(self, dir_model: Path, ftype: gguf.LlamaFileType, fname_out: Path, *,
             is_big_endian: bool = False,
             use_temp_file: bool = False, eager: bool = False,
             metadata_override: Path | None = None, model_name: str | None = None,
             split_max_tensors: int = 0, split_max_size: int = 0, dry_run: bool = False,
             small_first_shard: bool = False, hparams: dict[str, Any] | None = None,
             remote_hf_model_id: str | None = None,
             disable_mistral_community_chat_template: bool = False,
             sentence_transformers_dense_modules: bool = False):

ModelBase.write():

def write(self):
    self.prepare_tensors()
    self.prepare_metadata(vocab_only=False)
    self.gguf_writer.write_header_to_file(path=self.fname_out)
    self.gguf_writer.write_kv_data_to_file()
    self.gguf_writer.write_tensors_to_file(progress=True)
    self.gguf_writer.close()

ModelBase.prepare_metadata():

def prepare_metadata(self, vocab_only: bool):
    total_params, shared_params, expert_params, expert_count = self.gguf_writer.get_total_parameter_count()
    self.metadata = gguf.Metadata.load(self.metadata_override, self.dir_model_card, self.model_name, total_params)
    if self.remote_hf_model_id:
        self.metadata.name = self.remote_hf_model_id
    if self.metadata.name is None:
        self.metadata.name = self.dir_model.name
    if self.metadata.size_label is None and total_params > 0:
        self.metadata.size_label = gguf.size_label(total_params, shared_params, expert_params, expert_count)
    self.set_type()
    self.metadata.set_gguf_meta_model(self.gguf_writer)
    self.set_gguf_parameters()
    self.gguf_writer.add_quantization_version(gguf.GGML_QUANT_VERSION)

Import

import gguf
from pathlib import Path
import torch
import numpy as np
from transformers import AutoConfig

I/O Contract

Direction	Type	Description
Input	Directory (`Path`)	HuggingFace model directory containing weight files (`.safetensors` or `.bin`), `config.json`, and tokenizer files
Input	`gguf.LlamaFileType`	Target output type (ALL_F32, MOSTLY_F16, MOSTLY_BF16, MOSTLY_Q8_0, MOSTLY_TQ1_0, MOSTLY_TQ2_0, GUESSED)
Input	`Path`	Output file path
Output	GGUF file(s)	Binary file(s) in GGUF format containing header, metadata KV pairs, and tensor data
Side Effects	File system	Creates one or more `.gguf` files at the specified output path
Side Effects	stdout	Logs conversion progress, tensor mappings, and dtype conversions

Output type mapping (from main(), lines 11861-11868):

CLI Value	Internal Type	Description
`f32`	`gguf.LlamaFileType.ALL_F32`	Full 32-bit floating point
`f16`	`gguf.LlamaFileType.MOSTLY_F16`	16-bit float (IEEE 754 half)
`bf16`	`gguf.LlamaFileType.MOSTLY_BF16`	Brain floating point 16-bit
`q8_0`	`gguf.LlamaFileType.MOSTLY_Q8_0`	8-bit quantization (block size 32)
`tq1_0`	`gguf.LlamaFileType.MOSTLY_TQ1_0`	Ternary quantization variant 1
`tq2_0`	`gguf.LlamaFileType.MOSTLY_TQ2_0`	Ternary quantization variant 2
`auto`	`gguf.LlamaFileType.GUESSED`	Auto-detect from source tensor dtype

Usage Examples

Basic conversion with auto type detection:

python convert_hf_to_gguf.py ./models/Llama-3.1-8B-Instruct --outtype auto

Conversion to float16 with custom output path:

python convert_hf_to_gguf.py ./models/Llama-3.1-8B-Instruct \
    --outtype f16 \
    --outfile ./output/llama-3.1-8b-f16.gguf

Remote conversion (tensors streamed from HuggingFace Hub):

python convert_hf_to_gguf.py --remote --outtype bf16 meta-llama/Llama-3.1-8B-Instruct

Split output into multiple shards:

python convert_hf_to_gguf.py ./models/Llama-3.1-70B \
    --outtype q8_0 \
    --split-max-size 5G

Extract vocabulary only (no tensor conversion):

python convert_hf_to_gguf.py ./models/Llama-3.1-8B-Instruct --vocab-only

Export multimodal projector for a vision model:

python convert_hf_to_gguf.py ./models/llava-v1.6 --mmproj --outtype f16

Dry run to preview split plan:

python convert_hf_to_gguf.py ./models/Llama-3.1-70B \
    --outtype f16 \
    --split-max-tensors 100 \
    --dry-run

List all supported model architectures:

python convert_hf_to_gguf.py --print-supported-models

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment