Implementation:Ollama Ollama Quantize
| Knowledge Sources | |
|---|---|
| Domains | Model_Optimization, Compression |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
Concrete tool for quantizing GGUF model files to lower precision provided by the server and llama.cpp packages.
Description
The Go quantize function reads a full-precision GGUF file, applies per-tensor quantization type selection via getTensorNewType, and writes the quantized output. It delegates the actual tensor compression to the llama.cpp C library via CGo bindings.
getTensorNewType implements the tensor-level quantization policy: which tensors get which quantization type based on their name (attention vs feed-forward), shape (1D tensors stay float), and the target file type.
The C-level llama_model_quantize function in llama.cpp performs the heavy computation of quantizing tensor data blocks.
Usage
Used when creating quantized model variants via the ollama create command with the --quantize flag, or when the server auto-quantizes imported SafeTensors models.
Code Reference
Source Location
- Repository: ollama
- File: server/quantization.go (quantize, getTensorNewType), llama/llama.cpp/src/llama-quant.cpp (llama_model_quantize)
- Lines: quantization.go:L201-244 (quantize), quantization.go:L103-200 (getTensorNewType), llama-quant.cpp:L1-1072
Signature
func quantize(in, out *os.File, orig *fsggml.GGML, newFileType fsggml.FileType, progressFn func(n uint64)) error
func getTensorNewType(kv fsggml.KV, qs *quantizeState, newType fsggml.TensorType, name string, shape []uint64, ftype fsggml.FileType) fsggml.TensorType
Import
import "github.com/ollama/ollama/server"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| in | *os.File | Yes | Full-precision GGUF input file (F16 or F32) |
| out | *os.File | Yes | Output file for quantized GGUF |
| orig | *fsggml.GGML | Yes | Parsed GGUF metadata from input file |
| newFileType | fsggml.FileType | Yes | Target quantization type (Q4_0, Q4_K_M, Q5_K_M, Q8_0, etc.) |
| progressFn | func(n uint64) | No | Progress callback (bytes processed) |
Outputs
| Name | Type | Description |
|---|---|---|
| error | error | Non-nil if quantization fails |
| Side effect | Quantized GGUF | Compressed model file written to output |
Usage Examples
Quantize via CLI
# Create a quantized model from a Modelfile
ollama create my-model -f Modelfile --quantize q4_0
# Supported quantization types:
# q4_0, q4_1, q5_0, q5_1, q8_0
# q4_K_S, q4_K_M, q5_K_S, q5_K_M, q6_K