Implementation:Ggml org Ggml Common quantization types
Metadata
| Field | Value |
|---|---|
| Page Type | Implementation (Shared Header) |
| Knowledge Sources | GGML |
| Domains | ML_Infrastructure, Tensor_Computing, Quantization |
| Last Updated | 2026-02-10 12:00 GMT |
Overview
Shared header defining quantization block data structures and constants used identically across all GGML backends (CPU, CUDA, Metal, HIP, SYCL, CANN).
Description
ggml-common.h (1,878 lines) ensures binary-compatible quantized data layouts across every compute backend. This is essential for model portability -- a GGUF model file with quantized weights can be loaded and processed correctly by any backend without data conversion because they all share these exact same struct definitions.
The header uses a preprocessor-driven multi-target inclusion pattern. Before including this file, callers define one of:
GGML_COMMON_DECL_C-- for C source filesGGML_COMMON_DECL_CPP-- for C++ source filesGGML_COMMON_DECL_METAL-- for Metal shader filesGGML_COMMON_DECL_CUDA-- for CUDA kernelsGGML_COMMON_DECL_HIP-- for HIP (AMD ROCm) kernelsGGML_COMMON_DECL_SYCL-- for SYCL (Intel oneAPI) kernels
This sets ggml_half and ggml_half2 to the appropriate platform-specific half-precision type and configures union/struct aggregation macros.
Quantization block types defined:
- Basic types (QK=32):
block_q4_0,block_q4_1,block_q5_0,block_q5_1,block_q8_0,block_q8_1,block_mxfp4 - Ternary types (QK=256):
block_tq1_0(1.6875 bpw),block_tq2_0(2.0625 bpw) - K-quant super-block types (QK_K=256):
block_q2_Kthroughblock_q8_Kwith varying bits-per-weight (2.625 to 8) - IQ (importance matrix) types:
block_iq1_s,block_iq1_m,block_iq2_xxs,block_iq2_xs,block_iq2_s,block_iq3_xxs,block_iq3_s,block_iq4_nl,block_iq4_xs
Each block type includes static_assert checks for correct size and padding. For GPU backends (CUDA, HIP, SYCL), the header additionally defines QR and QI constants used in dequantization kernels.
Key constants:
QK_K = 256-- super-block size for K-quant typesK_SCALE_SIZE = 12-- size of scale data in K-quant blocks
Usage
Include this header in any backend implementation that needs to read or write quantized tensor data. Define the appropriate GGML_COMMON_DECL_* macro before inclusion to select the correct half-precision type for the target platform.
Code Reference
Source Location
GGML repo, file: src/ggml-common.h, 1878 lines.
Signature
// Platform-specific type definitions
typedef uint16_t ggml_half; // C/C++ (or half/sycl::half on GPU)
typedef uint32_t ggml_half2;
// Super-block size constants
#define QK_K 256
#define K_SCALE_SIZE 12
// Example quantization block structures
typedef struct {
ggml_half d; // delta (scale factor)
uint8_t qs[QK4_0 / 2]; // nibbles / quants
} block_q4_0;
typedef struct {
ggml_half d; // delta
int8_t qs[QK8_0]; // quants
} block_q8_0;
typedef struct {
uint8_t scales[QK_K/16];
uint8_t qs[QK_K/4];
ggml_half2 dm; // d and dmin packed
} block_q2_K;
Import
#define GGML_COMMON_DECL_C // or _CPP, _METAL, _CUDA, _HIP, _SYCL
#include "ggml-common.h"
I/O Contract
Inputs
| Parameter | Type | Required | Description |
|---|---|---|---|
| Preprocessor macro | define | Yes | One of GGML_COMMON_DECL_C, GGML_COMMON_DECL_CPP, GGML_COMMON_DECL_METAL, GGML_COMMON_DECL_CUDA, GGML_COMMON_DECL_HIP, or GGML_COMMON_DECL_SYCL must be defined before including the header.
|
Outputs
| Output | Type | Description |
|---|---|---|
| Type definitions | C/C++ types | Platform-appropriate ggml_half, ggml_half2, and all block_* quantization structures.
|
| Constants | preprocessor macros | QK_K, K_SCALE_SIZE, QK4_0, QK8_0, and GPU-specific QR_*/QI_* values.
|
Usage Examples
Including in a CUDA Backend
#define GGML_COMMON_DECL_CUDA
#include "ggml-common.h"
// Now block_q4_0, block_q8_0, etc. use cuda half type
__global__ void dequantize_q4_0(const block_q4_0 * src, float * dst) {
// Access src->d (ggml_half = half) and src->qs
}
Including in a C Source File
#define GGML_COMMON_DECL_C
#include "ggml-common.h"
// ggml_half is uint16_t, block types use standard C types
void process_q8_block(const block_q8_0 * block) {
float scale = ggml_fp16_to_fp32(block->d);
for (int i = 0; i < QK8_0; i++) {
float val = block->qs[i] * scale;
}
}