Implementation:Ollama Ollama XCreate Quantize

Knowledge Sources	Ollama
Domains	Model Creation, Quantization
Last Updated	2025-02-15 00:00 GMT

Overview

Provides MLX-based quantization for safetensors model weights during model creation, supporting int4, nvfp4, int8, and mxfp8 quantization modes.

Description

loadAndQuantizeArray writes a safetensors tensor to a temp file, loads it with MLX's native reader, optionally quantizes it (producing weight, scale, and optional bias arrays), and returns the arrays for evaluation. quantizeTensor wraps this for single-tensor blobs, adding metadata (quant_type, group_size) and saving the result as a combined safetensors blob. quantizePackedGroup handles multi-tensor expert groups for MoE models where each tensor may have a different quantization type. Float type conversion is applied before quantization if needed.

Usage

Called during ollama create when the --quantize flag is specified, converting full-precision model weights to quantized format for reduced memory usage and faster inference.

Code Reference

Source Location

Repository: Ollama
File: x/create/client/quantize.go
Lines: 1-237

Signature

var quantizeParams = map[string]struct {
    groupSize int
    bits      int
    mode      string
}{
    "int4":  {32, 4, "affine"},
    "nvfp4": {16, 4, "nvfp4"},
    "int8":  {64, 8, "affine"},
    "mxfp8": {32, 8, "mxfp8"},
}

func loadAndQuantizeArray(r io.Reader, name, quantize string, arrays map[string]*mlx.Array) (string, []*mlx.Array, *mlx.SafetensorsFile, error)
func quantizeTensor(r io.Reader, tensorName, dtype string, shape []int32, quantize string) ([]byte, error)
func quantizePackedGroup(inputs []create.PackedTensorInput) ([]byte, error)

Import

import "github.com/ollama/ollama/x/create/client"

I/O Contract

Inputs

Name	Type	Required	Description
r	io.Reader	Yes	Reader providing safetensors tensor data
tensorName	string	Yes	Name of the tensor being quantized
quantize	string	Yes	Quantization type: "int4", "nvfp4", "int8", "mxfp8"

Outputs

Name	Type	Description
blobData	[]byte	Combined safetensors blob with quantized tensors
error	error	Non-nil on quantization or I/O failure

Usage Examples

// Quantize a single tensor
blobData, err := quantizeTensor(reader, "model.layers.0.weight", "float16", []int32{4096, 4096}, "int4")

// Quantize a packed expert group
inputs := []create.PackedTensorInput{
    {Reader: r1, Name: "expert.0.weight", Quantize: "int4"},
    {Reader: r2, Name: "expert.1.weight", Quantize: "int4"},
}
blobData, err := quantizePackedGroup(inputs)

Related Pages

Principle:Ollama_Ollama_ModelCreation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment