Implementation:Ggml org Llama cpp ModelBase Write
| Field | Value |
|---|---|
| Implementation Name | ModelBase Write |
| Type | API Doc |
| Component | convert_hf_to_gguf.py -- ModelBase class
|
| Status | Active |
Overview
Description
The ModelBase class is the core abstraction in llama.cpp's HuggingFace-to-GGUF conversion pipeline. It provides the __init__() constructor for loading model data and the write() method for executing the full conversion. All architecture-specific model classes (e.g., LlamaModel, MistralModel, Qwen2Model) inherit from ModelBase (via TextModel or MmprojModel) and override hooks like set_gguf_parameters(), modify_tensors(), and set_vocab().
The write() method orchestrates the three-phase output process: tensor preparation, metadata preparation, and sequential file writing (header, KV data, tensor data).
The entry point is the main() function (lines 11828-11930), which parses CLI arguments, determines the model class, instantiates it, and calls either write() or write_vocab().
Usage
Command-line invocation:
python convert_hf_to_gguf.py <model_dir_or_repo_id> [options]
Key CLI parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
model |
positional | required | Directory containing model files, or HuggingFace repo ID (with --remote)
|
--outtype |
choice | auto |
Output format: f32, f16, bf16, q8_0, tq1_0, tq2_0, auto
|
--outfile |
path | auto-generated | Output file path; {ftype} is replaced by the output type
|
--model-name |
string | None |
Custom model name for GGUF metadata |
--vocab-only |
flag | False |
Extract only the vocabulary, skip tensor conversion |
--split-max-tensors |
int | 0 |
Maximum tensors per output shard (0 = no splitting) |
--split-max-size |
string | "0" |
Maximum size per shard, e.g., 2G, 500M (0 = no splitting)
|
--remote |
flag | False |
Read tensors remotely from HuggingFace Hub via HTTP |
--mmproj |
flag | False |
Export multimodal projector for vision models |
--bigendian |
flag | False |
Target big-endian byte order |
--use-temp-file |
flag | False |
Use temp files to reduce memory usage |
--no-lazy |
flag | False |
Disable lazy tensor evaluation (uses more RAM) |
--dry-run |
flag | False |
Print split plan without writing files |
--no-tensor-first-split |
flag | False |
Do not add tensors to the first split shard |
--metadata |
path | None |
Path to metadata override file |
--mistral-format |
flag | False |
Model uses Mistral native format |
--print-supported-models |
flag | False |
Print all supported model architectures and exit |
--sentence-transformers-dense-modules |
flag | False |
Include sentence-transformer dense modules |
Code Reference
Source Location
| File | Lines | Description |
|---|---|---|
convert_hf_to_gguf.py |
79 | ModelBase class definition
|
convert_hf_to_gguf.py |
113-168 | ModelBase.__init__() constructor
|
convert_hf_to_gguf.py |
527-650 | ModelBase.prepare_tensors() method
|
convert_hf_to_gguf.py |
655-682 | ModelBase.prepare_metadata() method
|
convert_hf_to_gguf.py |
687-693 | ModelBase.write() method
|
convert_hf_to_gguf.py |
11691-11785 | parse_args() function
|
convert_hf_to_gguf.py |
11828-11930 | main() entry point
|
Signature
ModelBase.__init__():
def __init__(self, dir_model: Path, ftype: gguf.LlamaFileType, fname_out: Path, *,
is_big_endian: bool = False,
use_temp_file: bool = False, eager: bool = False,
metadata_override: Path | None = None, model_name: str | None = None,
split_max_tensors: int = 0, split_max_size: int = 0, dry_run: bool = False,
small_first_shard: bool = False, hparams: dict[str, Any] | None = None,
remote_hf_model_id: str | None = None,
disable_mistral_community_chat_template: bool = False,
sentence_transformers_dense_modules: bool = False):
ModelBase.write():
def write(self):
self.prepare_tensors()
self.prepare_metadata(vocab_only=False)
self.gguf_writer.write_header_to_file(path=self.fname_out)
self.gguf_writer.write_kv_data_to_file()
self.gguf_writer.write_tensors_to_file(progress=True)
self.gguf_writer.close()
ModelBase.prepare_metadata():
def prepare_metadata(self, vocab_only: bool):
total_params, shared_params, expert_params, expert_count = self.gguf_writer.get_total_parameter_count()
self.metadata = gguf.Metadata.load(self.metadata_override, self.dir_model_card, self.model_name, total_params)
if self.remote_hf_model_id:
self.metadata.name = self.remote_hf_model_id
if self.metadata.name is None:
self.metadata.name = self.dir_model.name
if self.metadata.size_label is None and total_params > 0:
self.metadata.size_label = gguf.size_label(total_params, shared_params, expert_params, expert_count)
self.set_type()
self.metadata.set_gguf_meta_model(self.gguf_writer)
self.set_gguf_parameters()
self.gguf_writer.add_quantization_version(gguf.GGML_QUANT_VERSION)
Import
import gguf
from pathlib import Path
import torch
import numpy as np
from transformers import AutoConfig
I/O Contract
| Direction | Type | Description |
|---|---|---|
| Input | Directory (Path) |
HuggingFace model directory containing weight files (.safetensors or .bin), config.json, and tokenizer files
|
| Input | gguf.LlamaFileType |
Target output type (ALL_F32, MOSTLY_F16, MOSTLY_BF16, MOSTLY_Q8_0, MOSTLY_TQ1_0, MOSTLY_TQ2_0, GUESSED) |
| Input | Path |
Output file path |
| Output | GGUF file(s) | Binary file(s) in GGUF format containing header, metadata KV pairs, and tensor data |
| Side Effects | File system | Creates one or more .gguf files at the specified output path
|
| Side Effects | stdout | Logs conversion progress, tensor mappings, and dtype conversions |
Output type mapping (from main(), lines 11861-11868):
| CLI Value | Internal Type | Description |
|---|---|---|
f32 |
gguf.LlamaFileType.ALL_F32 |
Full 32-bit floating point |
f16 |
gguf.LlamaFileType.MOSTLY_F16 |
16-bit float (IEEE 754 half) |
bf16 |
gguf.LlamaFileType.MOSTLY_BF16 |
Brain floating point 16-bit |
q8_0 |
gguf.LlamaFileType.MOSTLY_Q8_0 |
8-bit quantization (block size 32) |
tq1_0 |
gguf.LlamaFileType.MOSTLY_TQ1_0 |
Ternary quantization variant 1 |
tq2_0 |
gguf.LlamaFileType.MOSTLY_TQ2_0 |
Ternary quantization variant 2 |
auto |
gguf.LlamaFileType.GUESSED |
Auto-detect from source tensor dtype |
Usage Examples
Basic conversion with auto type detection:
python convert_hf_to_gguf.py ./models/Llama-3.1-8B-Instruct --outtype auto
Conversion to float16 with custom output path:
python convert_hf_to_gguf.py ./models/Llama-3.1-8B-Instruct \
--outtype f16 \
--outfile ./output/llama-3.1-8b-f16.gguf
Remote conversion (tensors streamed from HuggingFace Hub):
python convert_hf_to_gguf.py --remote --outtype bf16 meta-llama/Llama-3.1-8B-Instruct
Split output into multiple shards:
python convert_hf_to_gguf.py ./models/Llama-3.1-70B \
--outtype q8_0 \
--split-max-size 5G
Extract vocabulary only (no tensor conversion):
python convert_hf_to_gguf.py ./models/Llama-3.1-8B-Instruct --vocab-only
Export multimodal projector for a vision model:
python convert_hf_to_gguf.py ./models/llava-v1.6 --mmproj --outtype f16
Dry run to preview split plan:
python convert_hf_to_gguf.py ./models/Llama-3.1-70B \
--outtype f16 \
--split-max-tensors 100 \
--dry-run
List all supported model architectures:
python convert_hf_to_gguf.py --print-supported-models