Heuristic:Unslothai Unsloth GGUF Quantization Selection
| Knowledge Sources | |
|---|---|
| Domains | Quantization, Model_Export, Optimization |
| Last Updated | 2026-02-07 09:00 GMT |
Overview
Unsloth maps user-friendly aliases (`"quantized"`, `"fast_quantized"`, `"not_quantized"`) to llama.cpp quantization formats, with `q4_k_m` as the recommended balance of quality and size.
Description
When exporting models to GGUF format, users choose a quantization method that determines the trade-off between model quality, file size, and inference speed. Unsloth provides three convenience aliases and 20+ direct llama.cpp quantization methods. The system performs a two-stage process: first converting to a high-precision intermediate format (bf16 or q8_0), then re-quantizing to the target format. If only `q8_0` is requested, a direct conversion path is used to avoid unnecessary double-quantization.
Usage
Use `"quantized"` (maps to `q4_k_m`) for production deployment with the best quality-to-size ratio. Use `"fast_quantized"` (maps to `q8_0`) when you need fast conversion with good quality. Use `"not_quantized"` to preserve the model's original dtype (bf16/f16). For advanced users, specify exact llama.cpp methods like `q5_k_m`, `q3_k_l`, or `q6_k` directly.
The Insight (Rule of Thumb)
- Action: Pass `quantization_method="quantized"` for best quality/size trade-off (equivalent to `q4_k_m`).
- Value: `q4_k_m` uses Q6_K for half of attention.wv and feed_forward.w2, Q4_K for the rest.
- Trade-off: `q4_k_m` is ~4x smaller than bf16 with minimal quality loss. `q8_0` is ~2x smaller with negligible quality loss but slower conversion for multi-quant export.
| Alias | Maps To | Best For |
|---|---|---|
| `"quantized"` | `q4_k_m` | Production deployment (fast inference, small files) |
| `"fast_quantized"` | `q8_0` | Quick export (high quality, OK size) |
| `"not_quantized"` | model dtype (bf16/f16) | Full precision preservation |
| `q5_k_m` | Q6_K + Q5_K mix | Higher quality than q4_k_m, larger files |
| `q3_k_m` | Q4_K + Q3_K mix | Aggressive compression for small models |
Reasoning
The `q4_k_m` method uses mixed quantization: attention value/weight projections and feed-forward second layers get the higher-precision Q6_K treatment (these layers have the most impact on quality), while other layers use the more compressed Q4_K. This selective approach preserves quality where it matters most. The alias system (`"quantized"` -> `q4_k_m`) simplifies the user interface while allowing advanced users to specify exact methods.
Alias mapping from `save.py:1128-1148`:
for quant_method in quantization_method:
if quant_method == "not_quantized":
quant_method = model_dtype
elif quant_method == "fast_quantized":
quant_method = "q8_0"
elif quant_method == "quantized":
quant_method = "q4_k_m"
elif quant_method is None:
quant_method = "q8_0"
Method descriptions from `save.py:104-131`:
ALLOWED_QUANTS = {
"q4_k_m": "Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K",
"q5_k_m": "Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K",
"q8_0": "Fast conversion. High resource use, but generally acceptable.",
"q2_k": "Uses Q4_K for the attention.vw and feed_forward.w2, Q2_K for the other tensors.",
...
}
Intermediate format selection from `save.py:1160-1170`:
if first_conversion is None:
if len(quantization_method) == 1 and quantization_method[0] == "q8_0":
first_conversion = "None" # Direct conversion
else:
first_conversion = "bf16" # Re-quantizing from q8_0 disallowed in new llama.cpp