Implementation:Unslothai Unsloth Save Pretrained GGUF

Knowledge Sources	Unsloth llama.cpp
Domains	Model_Deployment, Quantization, Serialization
Last Updated	2026-02-07 00:00 GMT

Overview

Concrete tool for exporting merged models to GGUF format with configurable quantization provided by the Unsloth library.

Description

model.save_pretrained_gguf orchestrates the full GGUF export pipeline:

Merges LoRA adapters if not already merged (calls unsloth_save_model internally)
Clones and builds llama.cpp if not available (automated CMake build)
Runs convert_hf_to_gguf.py to convert SafeTensors to GGUF format
Runs llama-quantize to apply the specified quantization scheme
Generates an Ollama Modelfile with the correct chat template

The function supports exporting multiple quantization levels at once by passing a list of quantization methods.

Usage

Call on a trained PeftModel or already-merged model. The function handles the entire pipeline automatically, including llama.cpp installation. Requires sufficient disk space for intermediate files.

Code Reference

Source Location

Repository: unsloth
File: unsloth/save.py
Lines: L1785-2058 (unsloth_save_pretrained_gguf), L1070-1336 (save_to_gguf orchestrator)

Signature

def unsloth_save_pretrained_gguf(
    self,
    save_directory: Union[str, os.PathLike],
    tokenizer = None,
    quantization_method = "fast_quantized",
    first_conversion: str = None,
    push_to_hub: bool = False,
    token: Optional[Union[str, bool]] = None,
    private: Optional[bool] = None,
    is_main_process: bool = True,
    state_dict: Optional[dict] = None,
    save_function: Callable = torch.save,
    max_shard_size: Union[int, str] = "5GB",
    safe_serialization: bool = True,
    variant: Optional[str] = None,
    save_peft_format: bool = True,
    tags: List[str] = None,
    temporary_location: str = "_unsloth_temporary_saved_buffers",
    maximum_memory_usage: float = 0.85,
) -> None:
    """
    Exports model to GGUF format with quantization.

    quantization_method options:
        "not_quantized"  — Fast conversion, no quantization. Large files.
        "fast_quantized" — Fast conversion, default quantization. Recommended.
        "quantized"      — Slow conversion, aggressive quantization.
        "f16"           — Float16, highest quality.
        "q8_0"          — 8-bit, fast and high quality.
        "q4_k_m"        — 4-bit mixed, recommended for deployment.
        "q5_k_m"        — 5-bit mixed, good quality.
        list[str]       — Multiple quantization levels exported at once.
    """

Import

# Called as a method on the model instance:
model.save_pretrained_gguf("./gguf_output", tokenizer=tokenizer, quantization_method="q4_k_m")

I/O Contract

Inputs

Name	Type	Required	Description
save_directory	str	Yes	Output directory for GGUF files
tokenizer	PreTrainedTokenizer	No	Tokenizer (needed for template detection)
quantization_method	str or list[str]	No	GGUF quantization type(s) (default: "fast_quantized")
first_conversion	str	No	Initial conversion dtype (f16/bf16)
maximum_memory_usage	float	No	GPU memory threshold (default: 0.85)

Outputs

Name	Type	Description
GGUF file	File	.gguf file in save_directory with specified quantization
Modelfile	File	Ollama Modelfile with correct chat template

Usage Examples

Export Single Quantization

model.save_pretrained_gguf(
    "./model-q4_k_m",
    tokenizer=tokenizer,
    quantization_method="q4_k_m",
)

Export Multiple Quantizations

model.save_pretrained_gguf(
    "./model-gguf",
    tokenizer=tokenizer,
    quantization_method=["q4_k_m", "q5_k_m", "q8_0"],
)
# Produces three .gguf files in the output directory

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment