Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Ggml org Llama cpp ModelBase Write

From Leeroopedia
Field Value
Implementation Name ModelBase Write
Type API Doc
Component convert_hf_to_gguf.py -- ModelBase class
Status Active

Overview

Description

The ModelBase class is the core abstraction in llama.cpp's HuggingFace-to-GGUF conversion pipeline. It provides the __init__() constructor for loading model data and the write() method for executing the full conversion. All architecture-specific model classes (e.g., LlamaModel, MistralModel, Qwen2Model) inherit from ModelBase (via TextModel or MmprojModel) and override hooks like set_gguf_parameters(), modify_tensors(), and set_vocab().

The write() method orchestrates the three-phase output process: tensor preparation, metadata preparation, and sequential file writing (header, KV data, tensor data).

The entry point is the main() function (lines 11828-11930), which parses CLI arguments, determines the model class, instantiates it, and calls either write() or write_vocab().

Usage

Command-line invocation:

python convert_hf_to_gguf.py <model_dir_or_repo_id> [options]

Key CLI parameters:

Parameter Type Default Description
model positional required Directory containing model files, or HuggingFace repo ID (with --remote)
--outtype choice auto Output format: f32, f16, bf16, q8_0, tq1_0, tq2_0, auto
--outfile path auto-generated Output file path; {ftype} is replaced by the output type
--model-name string None Custom model name for GGUF metadata
--vocab-only flag False Extract only the vocabulary, skip tensor conversion
--split-max-tensors int 0 Maximum tensors per output shard (0 = no splitting)
--split-max-size string "0" Maximum size per shard, e.g., 2G, 500M (0 = no splitting)
--remote flag False Read tensors remotely from HuggingFace Hub via HTTP
--mmproj flag False Export multimodal projector for vision models
--bigendian flag False Target big-endian byte order
--use-temp-file flag False Use temp files to reduce memory usage
--no-lazy flag False Disable lazy tensor evaluation (uses more RAM)
--dry-run flag False Print split plan without writing files
--no-tensor-first-split flag False Do not add tensors to the first split shard
--metadata path None Path to metadata override file
--mistral-format flag False Model uses Mistral native format
--print-supported-models flag False Print all supported model architectures and exit
--sentence-transformers-dense-modules flag False Include sentence-transformer dense modules

Code Reference

Source Location

File Lines Description
convert_hf_to_gguf.py 79 ModelBase class definition
convert_hf_to_gguf.py 113-168 ModelBase.__init__() constructor
convert_hf_to_gguf.py 527-650 ModelBase.prepare_tensors() method
convert_hf_to_gguf.py 655-682 ModelBase.prepare_metadata() method
convert_hf_to_gguf.py 687-693 ModelBase.write() method
convert_hf_to_gguf.py 11691-11785 parse_args() function
convert_hf_to_gguf.py 11828-11930 main() entry point

Signature

ModelBase.__init__():

def __init__(self, dir_model: Path, ftype: gguf.LlamaFileType, fname_out: Path, *,
             is_big_endian: bool = False,
             use_temp_file: bool = False, eager: bool = False,
             metadata_override: Path | None = None, model_name: str | None = None,
             split_max_tensors: int = 0, split_max_size: int = 0, dry_run: bool = False,
             small_first_shard: bool = False, hparams: dict[str, Any] | None = None,
             remote_hf_model_id: str | None = None,
             disable_mistral_community_chat_template: bool = False,
             sentence_transformers_dense_modules: bool = False):

ModelBase.write():

def write(self):
    self.prepare_tensors()
    self.prepare_metadata(vocab_only=False)
    self.gguf_writer.write_header_to_file(path=self.fname_out)
    self.gguf_writer.write_kv_data_to_file()
    self.gguf_writer.write_tensors_to_file(progress=True)
    self.gguf_writer.close()

ModelBase.prepare_metadata():

def prepare_metadata(self, vocab_only: bool):
    total_params, shared_params, expert_params, expert_count = self.gguf_writer.get_total_parameter_count()
    self.metadata = gguf.Metadata.load(self.metadata_override, self.dir_model_card, self.model_name, total_params)
    if self.remote_hf_model_id:
        self.metadata.name = self.remote_hf_model_id
    if self.metadata.name is None:
        self.metadata.name = self.dir_model.name
    if self.metadata.size_label is None and total_params > 0:
        self.metadata.size_label = gguf.size_label(total_params, shared_params, expert_params, expert_count)
    self.set_type()
    self.metadata.set_gguf_meta_model(self.gguf_writer)
    self.set_gguf_parameters()
    self.gguf_writer.add_quantization_version(gguf.GGML_QUANT_VERSION)

Import

import gguf
from pathlib import Path
import torch
import numpy as np
from transformers import AutoConfig

I/O Contract

Direction Type Description
Input Directory (Path) HuggingFace model directory containing weight files (.safetensors or .bin), config.json, and tokenizer files
Input gguf.LlamaFileType Target output type (ALL_F32, MOSTLY_F16, MOSTLY_BF16, MOSTLY_Q8_0, MOSTLY_TQ1_0, MOSTLY_TQ2_0, GUESSED)
Input Path Output file path
Output GGUF file(s) Binary file(s) in GGUF format containing header, metadata KV pairs, and tensor data
Side Effects File system Creates one or more .gguf files at the specified output path
Side Effects stdout Logs conversion progress, tensor mappings, and dtype conversions

Output type mapping (from main(), lines 11861-11868):

CLI Value Internal Type Description
f32 gguf.LlamaFileType.ALL_F32 Full 32-bit floating point
f16 gguf.LlamaFileType.MOSTLY_F16 16-bit float (IEEE 754 half)
bf16 gguf.LlamaFileType.MOSTLY_BF16 Brain floating point 16-bit
q8_0 gguf.LlamaFileType.MOSTLY_Q8_0 8-bit quantization (block size 32)
tq1_0 gguf.LlamaFileType.MOSTLY_TQ1_0 Ternary quantization variant 1
tq2_0 gguf.LlamaFileType.MOSTLY_TQ2_0 Ternary quantization variant 2
auto gguf.LlamaFileType.GUESSED Auto-detect from source tensor dtype

Usage Examples

Basic conversion with auto type detection:

python convert_hf_to_gguf.py ./models/Llama-3.1-8B-Instruct --outtype auto

Conversion to float16 with custom output path:

python convert_hf_to_gguf.py ./models/Llama-3.1-8B-Instruct \
    --outtype f16 \
    --outfile ./output/llama-3.1-8b-f16.gguf

Remote conversion (tensors streamed from HuggingFace Hub):

python convert_hf_to_gguf.py --remote --outtype bf16 meta-llama/Llama-3.1-8B-Instruct

Split output into multiple shards:

python convert_hf_to_gguf.py ./models/Llama-3.1-70B \
    --outtype q8_0 \
    --split-max-size 5G

Extract vocabulary only (no tensor conversion):

python convert_hf_to_gguf.py ./models/Llama-3.1-8B-Instruct --vocab-only

Export multimodal projector for a vision model:

python convert_hf_to_gguf.py ./models/llava-v1.6 --mmproj --outtype f16

Dry run to preview split plan:

python convert_hf_to_gguf.py ./models/Llama-3.1-70B \
    --outtype f16 \
    --split-max-tensors 100 \
    --dry-run

List all supported model architectures:

python convert_hf_to_gguf.py --print-supported-models

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment