Implementation:Unslothai Unsloth Save Pretrained GGUF
Appearance
| Knowledge Sources | |
|---|---|
| Domains | Model_Deployment, Quantization, Serialization |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Concrete tool for exporting merged models to GGUF format with configurable quantization provided by the Unsloth library.
Description
model.save_pretrained_gguf orchestrates the full GGUF export pipeline:
- Merges LoRA adapters if not already merged (calls unsloth_save_model internally)
- Clones and builds llama.cpp if not available (automated CMake build)
- Runs convert_hf_to_gguf.py to convert SafeTensors to GGUF format
- Runs llama-quantize to apply the specified quantization scheme
- Generates an Ollama Modelfile with the correct chat template
The function supports exporting multiple quantization levels at once by passing a list of quantization methods.
Usage
Call on a trained PeftModel or already-merged model. The function handles the entire pipeline automatically, including llama.cpp installation. Requires sufficient disk space for intermediate files.
Code Reference
Source Location
- Repository: unsloth
- File: unsloth/save.py
- Lines: L1785-2058 (unsloth_save_pretrained_gguf), L1070-1336 (save_to_gguf orchestrator)
Signature
def unsloth_save_pretrained_gguf(
self,
save_directory: Union[str, os.PathLike],
tokenizer = None,
quantization_method = "fast_quantized",
first_conversion: str = None,
push_to_hub: bool = False,
token: Optional[Union[str, bool]] = None,
private: Optional[bool] = None,
is_main_process: bool = True,
state_dict: Optional[dict] = None,
save_function: Callable = torch.save,
max_shard_size: Union[int, str] = "5GB",
safe_serialization: bool = True,
variant: Optional[str] = None,
save_peft_format: bool = True,
tags: List[str] = None,
temporary_location: str = "_unsloth_temporary_saved_buffers",
maximum_memory_usage: float = 0.85,
) -> None:
"""
Exports model to GGUF format with quantization.
quantization_method options:
"not_quantized" — Fast conversion, no quantization. Large files.
"fast_quantized" — Fast conversion, default quantization. Recommended.
"quantized" — Slow conversion, aggressive quantization.
"f16" — Float16, highest quality.
"q8_0" — 8-bit, fast and high quality.
"q4_k_m" — 4-bit mixed, recommended for deployment.
"q5_k_m" — 5-bit mixed, good quality.
list[str] — Multiple quantization levels exported at once.
"""
Import
# Called as a method on the model instance:
model.save_pretrained_gguf("./gguf_output", tokenizer=tokenizer, quantization_method="q4_k_m")
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| save_directory | str | Yes | Output directory for GGUF files |
| tokenizer | PreTrainedTokenizer | No | Tokenizer (needed for template detection) |
| quantization_method | str or list[str] | No | GGUF quantization type(s) (default: "fast_quantized") |
| first_conversion | str | No | Initial conversion dtype (f16/bf16) |
| maximum_memory_usage | float | No | GPU memory threshold (default: 0.85) |
Outputs
| Name | Type | Description |
|---|---|---|
| GGUF file | File | .gguf file in save_directory with specified quantization |
| Modelfile | File | Ollama Modelfile with correct chat template |
Usage Examples
Export Single Quantization
model.save_pretrained_gguf(
"./model-q4_k_m",
tokenizer=tokenizer,
quantization_method="q4_k_m",
)
Export Multiple Quantizations
model.save_pretrained_gguf(
"./model-gguf",
tokenizer=tokenizer,
quantization_method=["q4_k_m", "q5_k_m", "q8_0"],
)
# Produces three .gguf files in the output directory
Related Pages
Implements Principle
Requires Environment
Uses Heuristic
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment