Principle:Ollama Ollama Model Quantization
| Knowledge Sources | |
|---|---|
| Domains | Model_Optimization, Compression |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
A post-training quantization mechanism that reduces model precision from floating-point to lower-bit integer representations, decreasing model size and memory requirements while preserving inference quality.
Description
Model Quantization compresses neural network weights from full precision (FP32/FP16) to lower-bit representations (Q4_0, Q4_K_M, Q5_K_M, Q8_0, etc.). This reduces model file size by 2-8x and proportionally reduces memory requirements, enabling larger models to run on consumer hardware.
The quantization is performed per-tensor with type selection based on the tensor's role: attention weights, feed-forward layers, embeddings, and output heads may use different quantization types to balance size reduction against quality preservation. Critical layers (embeddings, output) are often kept at higher precision.
Usage
Use this principle when deploying models on resource-constrained hardware where memory is limited. Quantization is the standard technique for making 7B+ parameter models practical on consumer GPUs and CPUs.
Theoretical Basis
Quantization maps floating-point values to integer representations using block-wise scaling:
Where:
- is the original floating-point weight
- is the quantized integer
- is the per-block scale factor
- is the bit width
Common quantization types:
- Q4_0: 4-bit quantization with one scale per 32 elements
- Q4_K_M: 4-bit with k-quant medium (better quality)
- Q5_K_M: 5-bit with k-quant medium
- Q8_0: 8-bit quantization (highest quality, larger size)