Implementation:Ollama Ollama MLXRunner Ops Extra
| Knowledge Sources | |
|---|---|
| Domains | MLX Runtime, Tensor Operations |
| Last Updated | 2025-02-15 00:00 GMT |
Overview
Extended MLX operations including quantization, convenience wrappers, attention primitives, and additional tensor operations not covered in the core ops.go file.
Description
Provides quantization operations (Quantize, Dequantize, QuantizedMatmul, GatherQMM) supporting multiple modes (affine, nvfp4, mxfp8) with configurable group size and bits. Includes function-style wrappers (Add, Sub, Mul, Div, Matmul, Reshape, Transpose), neural network primitives (SiLU, RoPEWithBase, ScaledDotProductAttentionCausal, RMSNormFn), scalar helpers, array constructors, and a reflection-based Collect utility for gathering all Array pointers from nested structures.
Usage
Used extensively throughout model implementations for quantized inference, attention computation, and tensor manipulation. Provides the higher-level API that model code calls.
Code Reference
Source Location
- Repository: Ollama
- File: x/mlxrunner/mlx/ops_extra.go
- Lines: 1-450
Signature
func Quantize(w *Array, groupSize, bits int, mode string) (weights, scales, biases *Array)
func Dequantize(w, scales, biases *Array, groupSize, bits int, mode string) *Array
func QuantizedMatmul(x, w, scales, biases *Array, transpose bool, groupSize, bits int, mode string) *Array
func GatherQMM(x, w, scales *Array, biases, lhsIndices, rhsIndices *Array, transpose bool, groupSize, bits int, mode string, sortedIndices bool) *Array
func SiLU(a *Array) *Array
func RoPEWithBase(x *Array, dims int, traditional bool, base, scale float32, offset int) *Array
func ScaledDotProductAttentionCausal(q, k, v *Array, scale float32, causalMask bool) *Array
func RMSNormFn(x, weight *Array, eps float32) *Array
func Collect(v any) []*Array
Import
import "github.com/ollama/ollama/x/mlxrunner/mlx"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| w | *Array | Yes | Weight tensor to quantize |
| groupSize | int | Yes | Group size for quantization (e.g. 32, 64) |
| bits | int | Yes | Bits per weight (4 or 8) |
| mode | string | Yes | Quantization mode: "affine", "nvfp4", "mxfp8" |
Outputs
| Name | Type | Description |
|---|---|---|
| weights | *Array | Quantized weight data |
| scales | *Array | Scale factors for dequantization |
| biases | *Array | Quantization biases (nil for nvfp4) |
Usage Examples
// Quantize a weight tensor
qw, scales, biases := mlx.Quantize(weight, 32, 4, "affine")
// Quantized matrix multiplication
out := mlx.QuantizedMatmul(input, qw, scales, biases, true, 32, 4, "affine")
// Attention with causal mask
attn := mlx.ScaledDotProductAttentionCausal(q, k, v, scale, true)
// Collect all arrays from a model struct for evaluation
arrays := mlx.Collect(model)
mlx.Eval(arrays...)