Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Ggml org Ggml Quantization Block Formats

From Leeroopedia


Attribute Value
Page Type Principle
Full Name Ggml_org_Ggml_Quantization_Block_Formats
Short Name Quantization_Block_Formats
Domain Tags Quantization, Data_Format
Knowledge Source GGML
Last Updated 2026-02-10

Overview

Defining portable binary data structures for quantized tensor storage that are shared across all GGML backends -- CPU, GPU, and accelerator alike.

Description

Quantization Block Formats is the principle of establishing a universal set of packed binary structures that represent quantized tensor data in a backend-agnostic manner. Every quantization type in GGML is defined as a fixed-size C structure in the shared header ggml-common.h, which is compiled into CPU code (C/C++), GPU shaders (Metal, CUDA, HIP, SYCL, Vulkan), and other backend implementations. This shared definition ensures that quantized tensors can be serialized to disk, loaded into any backend's memory, and processed by any backend's kernels without format conversion.

Each block structure groups a fixed number of quantized values (the block size, typically 32 or 256 elements) together with per-block calibration parameters:

Format Family Block Size Structure Fields Key Characteristic
Q4_0 32 ggml_half d (delta/scale) + uint8_t qs[16] (nibbles) Simplest 4-bit: symmetric, single scale
Q4_1 32 ggml_half d + ggml_half m (min) + uint8_t qs[16] 4-bit with zero-point offset
Q5_0, Q5_1 32 Scale(s) + uint8_t qh[4] (high bits) + uint8_t qs[16] 5-bit: extra bit stored separately
Q8_0, Q8_1 32 Scale(s) + int8_t qs[32] 8-bit: full byte per value
Q2_K - Q6_K 256 Super-block scale + quantized sub-scales + quants K-quant: hierarchical scaling
IQ types 256 Importance-weighted encoding Non-uniform bit allocation
MXFP4 32 uint8_t e (E8M0 scale) + uint8_t qs[16] Microscaling 4-bit float

All structures use ggml_half (a typedef that maps to the appropriate 16-bit float type on each platform) for scale factors, ensuring binary compatibility across CPU (uint16_t), Metal (half), CUDA (half), HIP (half), and SYCL (sycl::half). The static_assert statements after each structure definition verify that the compiler produces the expected byte layout with no padding, a critical requirement for cross-platform binary compatibility.

Usage

Quantization block formats are foundational to GGML's quantization ecosystem:

  • Model file serialization: GGUF model files store weight tensors directly in these block formats. A model quantized on one platform can be loaded and executed on any other without conversion.
  • Cross-backend compatibility: A tensor in Q4_K format has identical binary representation whether it resides in CPU memory, CUDA device memory, or Metal GPU memory. Backends can directly read quantized blocks without format translation.
  • Kernel development: Every backend that implements quantized operations (dequantization, dot products, matrix multiplication) operates on these shared structure definitions, ensuring consistent semantics.
  • New format extension: Adding a new quantization type requires defining its block structure in ggml-common.h, which automatically makes it available to all backends that include this header.

Theoretical Basis

Block Quantization

Block quantization partitions a tensor's elements into fixed-size groups (blocks) and computes per-block quantization parameters (scale, zero-point). This provides a middle ground between per-tensor quantization (one set of parameters for the entire tensor, which loses fine-grained information) and per-element quantization (too much overhead). With block sizes of 32 or 256, the overhead of storing scale factors is amortized across many quantized values, typically adding only 0.5-2 bits per element to the storage cost.

The dequantization formula for symmetric quantization (Q4_0, Q8_0) is: x_i = d * q_i, where d is the block scale and q_i is the quantized integer. For asymmetric quantization (Q4_1, Q5_1): x_i = d * q_i + m, where m is the minimum value offset.

Hierarchical (K-Quant) Scaling

The K-quant formats (Q2_K through Q6_K) use a two-level scaling hierarchy. A super-block of 256 elements is divided into sub-blocks of 16 or 32 elements. Each sub-block has its own quantized scale factor (stored as a 4-bit or 6-bit integer), and the super-block provides a floating-point super-scale that calibrates all sub-block scales. This hierarchical approach captures local variations in weight distributions more accurately than a single flat scale, improving quantization quality at the same average bit width.

Cross-Platform Binary Compatibility

Achieving identical binary layout across C, C++, CUDA, Metal Shading Language, and SYCL requires careful management of type sizes, alignment, and padding. GGML addresses this through:

  • Platform-specific ggml_half typedefs (uint16_t for CPU, native half for GPU).
  • Explicit static_assert checks on structure sizes.
  • The GGML_COMMON_AGGR_U / GGML_COMMON_AGGR_S macros that handle anonymous union/struct differences across language standards.
  • Conditional compilation blocks (GGML_COMMON_DECL_C, GGML_COMMON_DECL_CUDA, GGML_COMMON_DECL_METAL, etc.) that select the appropriate type definitions for each target.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment