Heuristic:Ollama Ollama Quantization Layer Selection
| Knowledge Sources | |
|---|---|
| Domains | Quantization, Optimization, LLMs |
| Last Updated | 2026-02-14 22:00 GMT |
Overview
Layer-aware quantization strategy that preserves higher precision (Q6_K/Q8_0) for critical layers (first 1/8, last 1/8, attention V weights) while applying aggressive quantization (Q4_K) to middle layers.
Description
Not all layers in a transformer model are equally sensitive to quantization. The Ollama quantization pipeline implements a non-uniform quantization strategy where critical layers receive higher bit allocations. The `useMoreBits` function identifies which layers should retain higher precision: the first 1/8 and last 1/8 of layers (which handle input embedding context and output prediction), plus every 3rd layer in the middle section. Attention V (value) weights also receive special treatment with Q6_K or Q8_0 depending on model configuration.
Additionally, certain tensor types are never quantized: norm weights, vision encoder tensors, multimodal projection layers, 1D tensors, expert gating weights, positional embeddings, Mamba SSM convolution weights, and RWKV time-mix tensors.
Usage
Apply this heuristic when quantizing models from full precision to Q4_K_M or Q4_K_S formats. Understanding which layers get higher precision helps predict the quality-size tradeoff and diagnose quantization artifacts that appear as degraded output quality.
The Insight (Rule of Thumb)
- Action: Use the `useMoreBits` function to allocate Q6_K to first/last 1/8 of layers and every 3rd middle layer; apply Q4_K to the rest.
- Value: Attention V weights get Q6_K (Q4_K_M) or Q5_K (Q4_K_S first 4 layers). For 8-expert MoE models, attention V weights get Q8_0 (only ~128MB additional cost).
- Trade-off: Higher precision on critical layers preserves output quality with minimal size increase (~5-10% larger than uniform Q4_K).
- Never quantize: norm weights, vision encoder (`v.*`), multimodal (`mm.*`), 1D tensors, expert gating, positional embeddings, Mamba `ssm_conv1d`, LFM2 `shortconv`, RWKV `time_mix_*`, T5 `attn_rel_b`.
Reasoning
The first and last layers of a transformer are the most sensitive to quantization because they directly interface with the token embedding space. Quantization errors in these layers propagate through the entire forward pass. The attention V (value) weights carry the actual content information that gets aggregated by the attention mechanism, making them more quality-critical than Q (query) or K (key) weights.
For MoE (Mixture of Experts) models with 8 experts, bumping attention V to Q8_0 adds only ~128MB but meaningfully improves output coherence because these weights are shared across expert routing paths.
Layer selection heuristic from `server/quantization.go:57-59`:
func useMoreBits(iLayer, nLayers int) bool {
return iLayer < (nLayers/8) || iLayer >= 7*nLayers/8 ||
(iLayer-nLayers/8)%3 == 2
}
Attention V weight tuning from `server/quantization.go:114-134`:
if (ftype == fsggml.FileTypeQ4_K_M) && useMoreBits(qs.iAttnV, qs.nAttnV) {
newType = fsggml.TensorTypeQ6_K
} else if ftype == fsggml.FileTypeQ4_K_S && qs.iAttnV < 4 {
newType = fsggml.TensorTypeQ5_K
}
if nExperts == 8 {
// for the 8-expert model, bumping this to Q8_0 trades just ~128MB
newType = fsggml.TensorTypeQ8_0
}
Skip-quantization rules from `server/quantization.go:250-286`:
// don't quantize vision encoder tensors (named with "v." prefix)
quantize = quantize && !strings.HasPrefix(name, "v.")
quantize = quantize && !strings.Contains(name, "mm.")
// do not quantize norm tensors
quantize = quantize && !strings.Contains(name, "_norm.weight")
// do not quantize expert gating tensors
quantize = quantize && !strings.Contains(name, "ffn_gate_inp.weight")
// do not quantize Mamba's small yet 2D weights
quantize = quantize && !strings.Contains(name, "ssm_conv1d.weight")