Implementation:Ollama Ollama XCreate Quantize
| Knowledge Sources | |
|---|---|
| Domains | Model Creation, Quantization |
| Last Updated | 2025-02-15 00:00 GMT |
Overview
Provides MLX-based quantization for safetensors model weights during model creation, supporting int4, nvfp4, int8, and mxfp8 quantization modes.
Description
loadAndQuantizeArray writes a safetensors tensor to a temp file, loads it with MLX's native reader, optionally quantizes it (producing weight, scale, and optional bias arrays), and returns the arrays for evaluation. quantizeTensor wraps this for single-tensor blobs, adding metadata (quant_type, group_size) and saving the result as a combined safetensors blob. quantizePackedGroup handles multi-tensor expert groups for MoE models where each tensor may have a different quantization type. Float type conversion is applied before quantization if needed.
Usage
Called during ollama create when the --quantize flag is specified, converting full-precision model weights to quantized format for reduced memory usage and faster inference.
Code Reference
Source Location
- Repository: Ollama
- File: x/create/client/quantize.go
- Lines: 1-237
Signature
var quantizeParams = map[string]struct {
groupSize int
bits int
mode string
}{
"int4": {32, 4, "affine"},
"nvfp4": {16, 4, "nvfp4"},
"int8": {64, 8, "affine"},
"mxfp8": {32, 8, "mxfp8"},
}
func loadAndQuantizeArray(r io.Reader, name, quantize string, arrays map[string]*mlx.Array) (string, []*mlx.Array, *mlx.SafetensorsFile, error)
func quantizeTensor(r io.Reader, tensorName, dtype string, shape []int32, quantize string) ([]byte, error)
func quantizePackedGroup(inputs []create.PackedTensorInput) ([]byte, error)
Import
import "github.com/ollama/ollama/x/create/client"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| r | io.Reader | Yes | Reader providing safetensors tensor data |
| tensorName | string | Yes | Name of the tensor being quantized |
| quantize | string | Yes | Quantization type: "int4", "nvfp4", "int8", "mxfp8" |
Outputs
| Name | Type | Description |
|---|---|---|
| blobData | []byte | Combined safetensors blob with quantized tensors |
| error | error | Non-nil on quantization or I/O failure |
Usage Examples
// Quantize a single tensor
blobData, err := quantizeTensor(reader, "model.layers.0.weight", "float16", []int32{4096, 4096}, "int4")
// Quantize a packed expert group
inputs := []create.PackedTensorInput{
{Reader: r1, Name: "expert.0.weight", Quantize: "int4"},
{Reader: r2, Name: "expert.1.weight", Quantize: "int4"},
}
blobData, err := quantizePackedGroup(inputs)