Principle:Unslothai Unsloth GGUF Export

Knowledge Sources	Unsloth llama.cpp GGUF
Domains	Model_Deployment, Quantization, Serialization
Last Updated	2026-02-07 00:00 GMT

Overview

A model export technique that converts merged SafeTensors weights into the GGUF binary format with optional quantization for efficient CPU and edge-device inference via llama.cpp.

Description

GGUF (GGML Unified Format) is the standard file format for llama.cpp, the leading C++ inference engine for running LLMs on CPUs and consumer hardware. GGUF export takes a merged HuggingFace model and converts it through:

SafeTensors to GGUF Conversion: Using llama.cpp's convert_hf_to_gguf.py script to convert the HF model format to GGUF's tensor layout.
Quantization: Applying GGUF-specific quantization schemes (q4_k_m, q5_k_m, q8_0, etc.) that use mixed-precision strategies for optimal quality-size tradeoffs.
Ollama Modelfile Generation: Automatically generating an Ollama Modelfile with the correct chat template for local deployment.

GGUF quantization differs from training-time quantization (BitsAndBytes): it uses block-wise quantization with importance-based mixed precision, where attention and feed-forward layers can have different quantization levels.

Usage

Use this principle after saving a merged model when the deployment target is llama.cpp, Ollama, LM Studio, or other GGUF-compatible inference engines. Not needed for HuggingFace-native deployment.

Theoretical Basis

GGUF quantization applies block-wise quantization:

# Abstract GGUF quantization pipeline
merged_model = load_safetensors("./merged_model")
gguf_f16 = convert_hf_to_gguf(merged_model)      # HF format -> GGUF F16
gguf_quantized = quantize_gguf(gguf_f16, "q4_k_m")  # F16 -> Q4_K_M

# Q4_K_M uses Q6_K for half of attention.wv and ffn.w2 tensors,
# Q4_K for everything else — balancing quality and size

Common quantization types ordered by quality:

f16: Full float16, largest file, highest quality
q8_0: 8-bit, ~50% size reduction, minimal quality loss
q5_k_m: 5-bit mixed, good quality-size balance
q4_k_m: 4-bit mixed, recommended default
q2_k: 2-bit, smallest file, noticeable quality loss

Related Pages

Implemented By

Implementation:Unslothai_Unsloth_Save_Pretrained_GGUF

Uses Heuristic

Heuristic:Unslothai_Unsloth_GGUF_Quantization_Selection

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment