Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Unslothai Unsloth Save Pretrained GGUF

From Leeroopedia


Knowledge Sources
Domains Model_Deployment, Quantization, Serialization
Last Updated 2026-02-07 00:00 GMT

Overview

Concrete tool for exporting merged models to GGUF format with configurable quantization provided by the Unsloth library.

Description

model.save_pretrained_gguf orchestrates the full GGUF export pipeline:

  1. Merges LoRA adapters if not already merged (calls unsloth_save_model internally)
  2. Clones and builds llama.cpp if not available (automated CMake build)
  3. Runs convert_hf_to_gguf.py to convert SafeTensors to GGUF format
  4. Runs llama-quantize to apply the specified quantization scheme
  5. Generates an Ollama Modelfile with the correct chat template

The function supports exporting multiple quantization levels at once by passing a list of quantization methods.

Usage

Call on a trained PeftModel or already-merged model. The function handles the entire pipeline automatically, including llama.cpp installation. Requires sufficient disk space for intermediate files.

Code Reference

Source Location

  • Repository: unsloth
  • File: unsloth/save.py
  • Lines: L1785-2058 (unsloth_save_pretrained_gguf), L1070-1336 (save_to_gguf orchestrator)

Signature

def unsloth_save_pretrained_gguf(
    self,
    save_directory: Union[str, os.PathLike],
    tokenizer = None,
    quantization_method = "fast_quantized",
    first_conversion: str = None,
    push_to_hub: bool = False,
    token: Optional[Union[str, bool]] = None,
    private: Optional[bool] = None,
    is_main_process: bool = True,
    state_dict: Optional[dict] = None,
    save_function: Callable = torch.save,
    max_shard_size: Union[int, str] = "5GB",
    safe_serialization: bool = True,
    variant: Optional[str] = None,
    save_peft_format: bool = True,
    tags: List[str] = None,
    temporary_location: str = "_unsloth_temporary_saved_buffers",
    maximum_memory_usage: float = 0.85,
) -> None:
    """
    Exports model to GGUF format with quantization.

    quantization_method options:
        "not_quantized"  — Fast conversion, no quantization. Large files.
        "fast_quantized" — Fast conversion, default quantization. Recommended.
        "quantized"      — Slow conversion, aggressive quantization.
        "f16"           — Float16, highest quality.
        "q8_0"          — 8-bit, fast and high quality.
        "q4_k_m"        — 4-bit mixed, recommended for deployment.
        "q5_k_m"        — 5-bit mixed, good quality.
        list[str]       — Multiple quantization levels exported at once.
    """

Import

# Called as a method on the model instance:
model.save_pretrained_gguf("./gguf_output", tokenizer=tokenizer, quantization_method="q4_k_m")

I/O Contract

Inputs

Name Type Required Description
save_directory str Yes Output directory for GGUF files
tokenizer PreTrainedTokenizer No Tokenizer (needed for template detection)
quantization_method str or list[str] No GGUF quantization type(s) (default: "fast_quantized")
first_conversion str No Initial conversion dtype (f16/bf16)
maximum_memory_usage float No GPU memory threshold (default: 0.85)

Outputs

Name Type Description
GGUF file File .gguf file in save_directory with specified quantization
Modelfile File Ollama Modelfile with correct chat template

Usage Examples

Export Single Quantization

model.save_pretrained_gguf(
    "./model-q4_k_m",
    tokenizer=tokenizer,
    quantization_method="q4_k_m",
)

Export Multiple Quantizations

model.save_pretrained_gguf(
    "./model-gguf",
    tokenizer=tokenizer,
    quantization_method=["q4_k_m", "q5_k_m", "q8_0"],
)
# Produces three .gguf files in the output directory

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment