Heuristic:Unslothai Unsloth GGUF Quantization Selection

Knowledge Sources	Unsloth llama.cpp quantization
Domains	Quantization, Model_Export, Optimization
Last Updated	2026-02-07 09:00 GMT

Overview

Unsloth maps user-friendly aliases (`"quantized"`, `"fast_quantized"`, `"not_quantized"`) to llama.cpp quantization formats, with `q4_k_m` as the recommended balance of quality and size.

Description

When exporting models to GGUF format, users choose a quantization method that determines the trade-off between model quality, file size, and inference speed. Unsloth provides three convenience aliases and 20+ direct llama.cpp quantization methods. The system performs a two-stage process: first converting to a high-precision intermediate format (bf16 or q8_0), then re-quantizing to the target format. If only `q8_0` is requested, a direct conversion path is used to avoid unnecessary double-quantization.

Usage

Use `"quantized"` (maps to `q4_k_m`) for production deployment with the best quality-to-size ratio. Use `"fast_quantized"` (maps to `q8_0`) when you need fast conversion with good quality. Use `"not_quantized"` to preserve the model's original dtype (bf16/f16). For advanced users, specify exact llama.cpp methods like `q5_k_m`, `q3_k_l`, or `q6_k` directly.

The Insight (Rule of Thumb)

Action: Pass `quantization_method="quantized"` for best quality/size trade-off (equivalent to `q4_k_m`).
Value: `q4_k_m` uses Q6_K for half of attention.wv and feed_forward.w2, Q4_K for the rest.
Trade-off: `q4_k_m` is ~4x smaller than bf16 with minimal quality loss. `q8_0` is ~2x smaller with negligible quality loss but slower conversion for multi-quant export.

Alias	Maps To	Best For
`"quantized"`	`q4_k_m`	Production deployment (fast inference, small files)
`"fast_quantized"`	`q8_0`	Quick export (high quality, OK size)
`"not_quantized"`	model dtype (bf16/f16)	Full precision preservation
`q5_k_m`	Q6_K + Q5_K mix	Higher quality than q4_k_m, larger files
`q3_k_m`	Q4_K + Q3_K mix	Aggressive compression for small models

Reasoning

The `q4_k_m` method uses mixed quantization: attention value/weight projections and feed-forward second layers get the higher-precision Q6_K treatment (these layers have the most impact on quality), while other layers use the more compressed Q4_K. This selective approach preserves quality where it matters most. The alias system (`"quantized"` -> `q4_k_m`) simplifies the user interface while allowing advanced users to specify exact methods.

Alias mapping from `save.py:1128-1148`:

for quant_method in quantization_method:
    if quant_method == "not_quantized":
        quant_method = model_dtype
    elif quant_method == "fast_quantized":
        quant_method = "q8_0"
    elif quant_method == "quantized":
        quant_method = "q4_k_m"
    elif quant_method is None:
        quant_method = "q8_0"

Method descriptions from `save.py:104-131`:

ALLOWED_QUANTS = {
    "q4_k_m": "Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K",
    "q5_k_m": "Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K",
    "q8_0":   "Fast conversion. High resource use, but generally acceptable.",
    "q2_k":   "Uses Q4_K for the attention.vw and feed_forward.w2, Q2_K for the other tensors.",
    ...
}

Intermediate format selection from `save.py:1160-1170`:

if first_conversion is None:
    if len(quantization_method) == 1 and quantization_method[0] == "q8_0":
        first_conversion = "None"  # Direct conversion
    else:
        first_conversion = "bf16"  # Re-quantizing from q8_0 disallowed in new llama.cpp

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment