Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Ggml org Ggml Common quantization types

From Leeroopedia


Metadata

Field Value
Page Type Implementation (Shared Header)
Knowledge Sources GGML
Domains ML_Infrastructure, Tensor_Computing, Quantization
Last Updated 2026-02-10 12:00 GMT

Overview

Shared header defining quantization block data structures and constants used identically across all GGML backends (CPU, CUDA, Metal, HIP, SYCL, CANN).

Description

ggml-common.h (1,878 lines) ensures binary-compatible quantized data layouts across every compute backend. This is essential for model portability -- a GGUF model file with quantized weights can be loaded and processed correctly by any backend without data conversion because they all share these exact same struct definitions.

The header uses a preprocessor-driven multi-target inclusion pattern. Before including this file, callers define one of:

  • GGML_COMMON_DECL_C -- for C source files
  • GGML_COMMON_DECL_CPP -- for C++ source files
  • GGML_COMMON_DECL_METAL -- for Metal shader files
  • GGML_COMMON_DECL_CUDA -- for CUDA kernels
  • GGML_COMMON_DECL_HIP -- for HIP (AMD ROCm) kernels
  • GGML_COMMON_DECL_SYCL -- for SYCL (Intel oneAPI) kernels

This sets ggml_half and ggml_half2 to the appropriate platform-specific half-precision type and configures union/struct aggregation macros.

Quantization block types defined:

  • Basic types (QK=32): block_q4_0, block_q4_1, block_q5_0, block_q5_1, block_q8_0, block_q8_1, block_mxfp4
  • Ternary types (QK=256): block_tq1_0 (1.6875 bpw), block_tq2_0 (2.0625 bpw)
  • K-quant super-block types (QK_K=256): block_q2_K through block_q8_K with varying bits-per-weight (2.625 to 8)
  • IQ (importance matrix) types: block_iq1_s, block_iq1_m, block_iq2_xxs, block_iq2_xs, block_iq2_s, block_iq3_xxs, block_iq3_s, block_iq4_nl, block_iq4_xs

Each block type includes static_assert checks for correct size and padding. For GPU backends (CUDA, HIP, SYCL), the header additionally defines QR and QI constants used in dequantization kernels.

Key constants:

  • QK_K = 256 -- super-block size for K-quant types
  • K_SCALE_SIZE = 12 -- size of scale data in K-quant blocks

Usage

Include this header in any backend implementation that needs to read or write quantized tensor data. Define the appropriate GGML_COMMON_DECL_* macro before inclusion to select the correct half-precision type for the target platform.

Code Reference

Source Location

GGML repo, file: src/ggml-common.h, 1878 lines.

Signature

// Platform-specific type definitions
typedef uint16_t ggml_half;   // C/C++ (or half/sycl::half on GPU)
typedef uint32_t ggml_half2;

// Super-block size constants
#define QK_K 256
#define K_SCALE_SIZE 12

// Example quantization block structures
typedef struct {
    ggml_half d;            // delta (scale factor)
    uint8_t qs[QK4_0 / 2]; // nibbles / quants
} block_q4_0;

typedef struct {
    ggml_half d;    // delta
    int8_t qs[QK8_0]; // quants
} block_q8_0;

typedef struct {
    uint8_t scales[QK_K/16];
    uint8_t qs[QK_K/4];
    ggml_half2 dm;  // d and dmin packed
} block_q2_K;

Import

#define GGML_COMMON_DECL_C   // or _CPP, _METAL, _CUDA, _HIP, _SYCL
#include "ggml-common.h"

I/O Contract

Inputs

Parameter Type Required Description
Preprocessor macro define Yes One of GGML_COMMON_DECL_C, GGML_COMMON_DECL_CPP, GGML_COMMON_DECL_METAL, GGML_COMMON_DECL_CUDA, GGML_COMMON_DECL_HIP, or GGML_COMMON_DECL_SYCL must be defined before including the header.

Outputs

Output Type Description
Type definitions C/C++ types Platform-appropriate ggml_half, ggml_half2, and all block_* quantization structures.
Constants preprocessor macros QK_K, K_SCALE_SIZE, QK4_0, QK8_0, and GPU-specific QR_*/QI_* values.

Usage Examples

Including in a CUDA Backend

#define GGML_COMMON_DECL_CUDA
#include "ggml-common.h"

// Now block_q4_0, block_q8_0, etc. use cuda half type
__global__ void dequantize_q4_0(const block_q4_0 * src, float * dst) {
    // Access src->d (ggml_half = half) and src->qs
}

Including in a C Source File

#define GGML_COMMON_DECL_C
#include "ggml-common.h"

// ggml_half is uint16_t, block types use standard C types
void process_q8_block(const block_q8_0 * block) {
    float scale = ggml_fp16_to_fp32(block->d);
    for (int i = 0; i < QK8_0; i++) {
        float val = block->qs[i] * scale;
    }
}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment