Principle:Ggml org Ggml Quantization Block Formats
| Attribute | Value |
|---|---|
| Page Type | Principle |
| Full Name | Ggml_org_Ggml_Quantization_Block_Formats |
| Short Name | Quantization_Block_Formats |
| Domain Tags | Quantization, Data_Format |
| Knowledge Source | GGML |
| Last Updated | 2026-02-10 |
Overview
Defining portable binary data structures for quantized tensor storage that are shared across all GGML backends -- CPU, GPU, and accelerator alike.
Description
Quantization Block Formats is the principle of establishing a universal set of packed binary structures that represent quantized tensor data in a backend-agnostic manner. Every quantization type in GGML is defined as a fixed-size C structure in the shared header ggml-common.h, which is compiled into CPU code (C/C++), GPU shaders (Metal, CUDA, HIP, SYCL, Vulkan), and other backend implementations. This shared definition ensures that quantized tensors can be serialized to disk, loaded into any backend's memory, and processed by any backend's kernels without format conversion.
Each block structure groups a fixed number of quantized values (the block size, typically 32 or 256 elements) together with per-block calibration parameters:
| Format Family | Block Size | Structure Fields | Key Characteristic |
|---|---|---|---|
| Q4_0 | 32 | ggml_half d (delta/scale) + uint8_t qs[16] (nibbles) |
Simplest 4-bit: symmetric, single scale |
| Q4_1 | 32 | ggml_half d + ggml_half m (min) + uint8_t qs[16] |
4-bit with zero-point offset |
| Q5_0, Q5_1 | 32 | Scale(s) + uint8_t qh[4] (high bits) + uint8_t qs[16] |
5-bit: extra bit stored separately |
| Q8_0, Q8_1 | 32 | Scale(s) + int8_t qs[32] |
8-bit: full byte per value |
| Q2_K - Q6_K | 256 | Super-block scale + quantized sub-scales + quants | K-quant: hierarchical scaling |
| IQ types | 256 | Importance-weighted encoding | Non-uniform bit allocation |
| MXFP4 | 32 | uint8_t e (E8M0 scale) + uint8_t qs[16] |
Microscaling 4-bit float |
All structures use ggml_half (a typedef that maps to the appropriate 16-bit float type on each platform) for scale factors, ensuring binary compatibility across CPU (uint16_t), Metal (half), CUDA (half), HIP (half), and SYCL (sycl::half). The static_assert statements after each structure definition verify that the compiler produces the expected byte layout with no padding, a critical requirement for cross-platform binary compatibility.
Usage
Quantization block formats are foundational to GGML's quantization ecosystem:
- Model file serialization: GGUF model files store weight tensors directly in these block formats. A model quantized on one platform can be loaded and executed on any other without conversion.
- Cross-backend compatibility: A tensor in Q4_K format has identical binary representation whether it resides in CPU memory, CUDA device memory, or Metal GPU memory. Backends can directly read quantized blocks without format translation.
- Kernel development: Every backend that implements quantized operations (dequantization, dot products, matrix multiplication) operates on these shared structure definitions, ensuring consistent semantics.
- New format extension: Adding a new quantization type requires defining its block structure in
ggml-common.h, which automatically makes it available to all backends that include this header.
Theoretical Basis
Block Quantization
Block quantization partitions a tensor's elements into fixed-size groups (blocks) and computes per-block quantization parameters (scale, zero-point). This provides a middle ground between per-tensor quantization (one set of parameters for the entire tensor, which loses fine-grained information) and per-element quantization (too much overhead). With block sizes of 32 or 256, the overhead of storing scale factors is amortized across many quantized values, typically adding only 0.5-2 bits per element to the storage cost.
The dequantization formula for symmetric quantization (Q4_0, Q8_0) is: x_i = d * q_i, where d is the block scale and q_i is the quantized integer. For asymmetric quantization (Q4_1, Q5_1): x_i = d * q_i + m, where m is the minimum value offset.
Hierarchical (K-Quant) Scaling
The K-quant formats (Q2_K through Q6_K) use a two-level scaling hierarchy. A super-block of 256 elements is divided into sub-blocks of 16 or 32 elements. Each sub-block has its own quantized scale factor (stored as a 4-bit or 6-bit integer), and the super-block provides a floating-point super-scale that calibrates all sub-block scales. This hierarchical approach captures local variations in weight distributions more accurately than a single flat scale, improving quantization quality at the same average bit width.
Cross-Platform Binary Compatibility
Achieving identical binary layout across C, C++, CUDA, Metal Shading Language, and SYCL requires careful management of type sizes, alignment, and padding. GGML addresses this through:
- Platform-specific
ggml_halftypedefs (uint16_t for CPU, native half for GPU). - Explicit
static_assertchecks on structure sizes. - The
GGML_COMMON_AGGR_U/GGML_COMMON_AGGR_Smacros that handle anonymous union/struct differences across language standards. - Conditional compilation blocks (
GGML_COMMON_DECL_C,GGML_COMMON_DECL_CUDA,GGML_COMMON_DECL_METAL, etc.) that select the appropriate type definitions for each target.