Implementation:Ollama Ollama MLXRunner Ops Extra

Knowledge Sources	Ollama
Domains	MLX Runtime, Tensor Operations
Last Updated	2025-02-15 00:00 GMT

Overview

Extended MLX operations including quantization, convenience wrappers, attention primitives, and additional tensor operations not covered in the core ops.go file.

Description

Provides quantization operations (Quantize, Dequantize, QuantizedMatmul, GatherQMM) supporting multiple modes (affine, nvfp4, mxfp8) with configurable group size and bits. Includes function-style wrappers (Add, Sub, Mul, Div, Matmul, Reshape, Transpose), neural network primitives (SiLU, RoPEWithBase, ScaledDotProductAttentionCausal, RMSNormFn), scalar helpers, array constructors, and a reflection-based Collect utility for gathering all Array pointers from nested structures.

Usage

Used extensively throughout model implementations for quantized inference, attention computation, and tensor manipulation. Provides the higher-level API that model code calls.

Code Reference

Source Location

Repository: Ollama
File: x/mlxrunner/mlx/ops_extra.go
Lines: 1-450

Signature

func Quantize(w *Array, groupSize, bits int, mode string) (weights, scales, biases *Array)
func Dequantize(w, scales, biases *Array, groupSize, bits int, mode string) *Array
func QuantizedMatmul(x, w, scales, biases *Array, transpose bool, groupSize, bits int, mode string) *Array
func GatherQMM(x, w, scales *Array, biases, lhsIndices, rhsIndices *Array, transpose bool, groupSize, bits int, mode string, sortedIndices bool) *Array
func SiLU(a *Array) *Array
func RoPEWithBase(x *Array, dims int, traditional bool, base, scale float32, offset int) *Array
func ScaledDotProductAttentionCausal(q, k, v *Array, scale float32, causalMask bool) *Array
func RMSNormFn(x, weight *Array, eps float32) *Array
func Collect(v any) []*Array

Import

import "github.com/ollama/ollama/x/mlxrunner/mlx"

I/O Contract

Inputs

Name	Type	Required	Description
w	*Array	Yes	Weight tensor to quantize
groupSize	int	Yes	Group size for quantization (e.g. 32, 64)
bits	int	Yes	Bits per weight (4 or 8)
mode	string	Yes	Quantization mode: "affine", "nvfp4", "mxfp8"

Outputs

Name	Type	Description
weights	*Array	Quantized weight data
scales	*Array	Scale factors for dequantization
biases	*Array	Quantization biases (nil for nvfp4)

Usage Examples

// Quantize a weight tensor
qw, scales, biases := mlx.Quantize(weight, 32, 4, "affine")

// Quantized matrix multiplication
out := mlx.QuantizedMatmul(input, qw, scales, biases, true, 32, 4, "affine")

// Attention with causal mask
attn := mlx.ScaledDotProductAttentionCausal(q, k, v, scale, true)

// Collect all arrays from a model struct for evaluation
arrays := mlx.Collect(model)
mlx.Eval(arrays...)

Related Pages

Principle:Ollama_Ollama_MLXRunner_Architecture

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment