Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Unslothai Unsloth GGUF Export

From Leeroopedia


Knowledge Sources
Domains Model_Deployment, Quantization, Serialization
Last Updated 2026-02-07 00:00 GMT

Overview

A model export technique that converts merged SafeTensors weights into the GGUF binary format with optional quantization for efficient CPU and edge-device inference via llama.cpp.

Description

GGUF (GGML Unified Format) is the standard file format for llama.cpp, the leading C++ inference engine for running LLMs on CPUs and consumer hardware. GGUF export takes a merged HuggingFace model and converts it through:

  1. SafeTensors to GGUF Conversion: Using llama.cpp's convert_hf_to_gguf.py script to convert the HF model format to GGUF's tensor layout.
  2. Quantization: Applying GGUF-specific quantization schemes (q4_k_m, q5_k_m, q8_0, etc.) that use mixed-precision strategies for optimal quality-size tradeoffs.
  3. Ollama Modelfile Generation: Automatically generating an Ollama Modelfile with the correct chat template for local deployment.

GGUF quantization differs from training-time quantization (BitsAndBytes): it uses block-wise quantization with importance-based mixed precision, where attention and feed-forward layers can have different quantization levels.

Usage

Use this principle after saving a merged model when the deployment target is llama.cpp, Ollama, LM Studio, or other GGUF-compatible inference engines. Not needed for HuggingFace-native deployment.

Theoretical Basis

GGUF quantization applies block-wise quantization:

# Abstract GGUF quantization pipeline
merged_model = load_safetensors("./merged_model")
gguf_f16 = convert_hf_to_gguf(merged_model)      # HF format -> GGUF F16
gguf_quantized = quantize_gguf(gguf_f16, "q4_k_m")  # F16 -> Q4_K_M

# Q4_K_M uses Q6_K for half of attention.wv and ffn.w2 tensors,
# Q4_K for everything else — balancing quality and size

Common quantization types ordered by quality:

  • f16: Full float16, largest file, highest quality
  • q8_0: 8-bit, ~50% size reduction, minimal quality loss
  • q5_k_m: 5-bit mixed, good quality-size balance
  • q4_k_m: 4-bit mixed, recommended default
  • q2_k: 2-bit, smallest file, noticeable quality loss

Related Pages

Implemented By

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment