Implementation:Ggml org Ggml Common quantization types

Metadata

Field	Value
Page Type	Implementation (Shared Header)
Knowledge Sources	GGML
Domains	ML_Infrastructure, Tensor_Computing, Quantization
Last Updated	2026-02-10 12:00 GMT

Overview

Shared header defining quantization block data structures and constants used identically across all GGML backends (CPU, CUDA, Metal, HIP, SYCL, CANN).

Description

ggml-common.h (1,878 lines) ensures binary-compatible quantized data layouts across every compute backend. This is essential for model portability -- a GGUF model file with quantized weights can be loaded and processed correctly by any backend without data conversion because they all share these exact same struct definitions.

The header uses a preprocessor-driven multi-target inclusion pattern. Before including this file, callers define one of:

GGML_COMMON_DECL_C -- for C source files
GGML_COMMON_DECL_CPP -- for C++ source files
GGML_COMMON_DECL_METAL -- for Metal shader files
GGML_COMMON_DECL_CUDA -- for CUDA kernels
GGML_COMMON_DECL_HIP -- for HIP (AMD ROCm) kernels
GGML_COMMON_DECL_SYCL -- for SYCL (Intel oneAPI) kernels

This sets ggml_half and ggml_half2 to the appropriate platform-specific half-precision type and configures union/struct aggregation macros.

Quantization block types defined:

Basic types (QK=32): block_q4_0, block_q4_1, block_q5_0, block_q5_1, block_q8_0, block_q8_1, block_mxfp4
Ternary types (QK=256): block_tq1_0 (1.6875 bpw), block_tq2_0 (2.0625 bpw)
K-quant super-block types (QK_K=256): block_q2_K through block_q8_K with varying bits-per-weight (2.625 to 8)
IQ (importance matrix) types: block_iq1_s, block_iq1_m, block_iq2_xxs, block_iq2_xs, block_iq2_s, block_iq3_xxs, block_iq3_s, block_iq4_nl, block_iq4_xs

Each block type includes static_assert checks for correct size and padding. For GPU backends (CUDA, HIP, SYCL), the header additionally defines QR and QI constants used in dequantization kernels.

Key constants:

QK_K = 256 -- super-block size for K-quant types
K_SCALE_SIZE = 12 -- size of scale data in K-quant blocks

Usage

Include this header in any backend implementation that needs to read or write quantized tensor data. Define the appropriate GGML_COMMON_DECL_* macro before inclusion to select the correct half-precision type for the target platform.

Code Reference

Source Location

GGML repo, file: src/ggml-common.h, 1878 lines.

Signature

// Platform-specific type definitions
typedef uint16_t ggml_half;   // C/C++ (or half/sycl::half on GPU)
typedef uint32_t ggml_half2;

// Super-block size constants
#define QK_K 256
#define K_SCALE_SIZE 12

// Example quantization block structures
typedef struct {
    ggml_half d;            // delta (scale factor)
    uint8_t qs[QK4_0 / 2]; // nibbles / quants
} block_q4_0;

typedef struct {
    ggml_half d;    // delta
    int8_t qs[QK8_0]; // quants
} block_q8_0;

typedef struct {
    uint8_t scales[QK_K/16];
    uint8_t qs[QK_K/4];
    ggml_half2 dm;  // d and dmin packed
} block_q2_K;

Import

#define GGML_COMMON_DECL_C   // or _CPP, _METAL, _CUDA, _HIP, _SYCL
#include "ggml-common.h"

I/O Contract

Inputs

Parameter	Type	Required	Description
Preprocessor macro	define	Yes	One of `GGML_COMMON_DECL_C`, `GGML_COMMON_DECL_CPP`, `GGML_COMMON_DECL_METAL`, `GGML_COMMON_DECL_CUDA`, `GGML_COMMON_DECL_HIP`, or `GGML_COMMON_DECL_SYCL` must be defined before including the header.

Outputs

Output	Type	Description
Type definitions	C/C++ types	Platform-appropriate `ggml_half`, `ggml_half2`, and all `block_*` quantization structures.
Constants	preprocessor macros	`QK_K`, `K_SCALE_SIZE`, `QK4_0`, `QK8_0`, and GPU-specific `QR_`/`QI_` values.

Usage Examples

Including in a CUDA Backend

#define GGML_COMMON_DECL_CUDA
#include "ggml-common.h"

// Now block_q4_0, block_q8_0, etc. use cuda half type
__global__ void dequantize_q4_0(const block_q4_0 * src, float * dst) {
    // Access src->d (ggml_half = half) and src->qs
}

Including in a C Source File

#define GGML_COMMON_DECL_C
#include "ggml-common.h"

// ggml_half is uint16_t, block types use standard C types
void process_q8_block(const block_q8_0 * block) {
    float scale = ggml_fp16_to_fp32(block->d);
    for (int i = 0; i < QK8_0; i++) {
        float val = block->qs[i] * scale;
    }
}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment