Principle:Unslothai Unsloth GGUF Export
| Knowledge Sources | |
|---|---|
| Domains | Model_Deployment, Quantization, Serialization |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
A model export technique that converts merged SafeTensors weights into the GGUF binary format with optional quantization for efficient CPU and edge-device inference via llama.cpp.
Description
GGUF (GGML Unified Format) is the standard file format for llama.cpp, the leading C++ inference engine for running LLMs on CPUs and consumer hardware. GGUF export takes a merged HuggingFace model and converts it through:
- SafeTensors to GGUF Conversion: Using llama.cpp's convert_hf_to_gguf.py script to convert the HF model format to GGUF's tensor layout.
- Quantization: Applying GGUF-specific quantization schemes (q4_k_m, q5_k_m, q8_0, etc.) that use mixed-precision strategies for optimal quality-size tradeoffs.
- Ollama Modelfile Generation: Automatically generating an Ollama Modelfile with the correct chat template for local deployment.
GGUF quantization differs from training-time quantization (BitsAndBytes): it uses block-wise quantization with importance-based mixed precision, where attention and feed-forward layers can have different quantization levels.
Usage
Use this principle after saving a merged model when the deployment target is llama.cpp, Ollama, LM Studio, or other GGUF-compatible inference engines. Not needed for HuggingFace-native deployment.
Theoretical Basis
GGUF quantization applies block-wise quantization:
# Abstract GGUF quantization pipeline
merged_model = load_safetensors("./merged_model")
gguf_f16 = convert_hf_to_gguf(merged_model) # HF format -> GGUF F16
gguf_quantized = quantize_gguf(gguf_f16, "q4_k_m") # F16 -> Q4_K_M
# Q4_K_M uses Q6_K for half of attention.wv and ffn.w2 tensors,
# Q4_K for everything else — balancing quality and size
Common quantization types ordered by quality:
- f16: Full float16, largest file, highest quality
- q8_0: 8-bit, ~50% size reduction, minimal quality loss
- q5_k_m: 5-bit mixed, good quality-size balance
- q4_k_m: 4-bit mixed, recommended default
- q2_k: 2-bit, smallest file, noticeable quality loss