Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Vllm project Vllm GGML Common

From Leeroopedia


Knowledge Sources
Domains Quantization, GGUF
Last Updated 2026-02-08 00:00 GMT

Overview

Defines GGUF/GGML quantization block data structures for 2-bit through 8-bit quantization formats, enabling compatibility with llama.cpp quantized models.

Description

This header file, originally adapted from llama.cpp, declares a comprehensive set of quantization block structures (block_q4_0, block_q4_1, block_q5_0, block_q5_1, block_q8_0, block_q8_1, block_q2_K through block_q6_K, and various block_iq types) used for K-quantization formats. Each structure contains a delta/scale value (typically a half or half2) and packed quantized weight nibbles or bytes. The file also defines constants such as QK_K (256 values per super-block), WARP_SIZE_GGUF (32), and CUDA block sizes for dequantization and quantization kernels.

Usage

This header is included at compile time by GGUF dequantization CUDA kernels in the vLLM C++ source tree. It is used whenever vLLM loads and processes models stored in the GGUF quantized format, providing the memory layout definitions needed to correctly interpret quantized weight data.

Code Reference

Source Location

Signature

#define QK_K 256
#define K_QUANTS_PER_ITERATION 2
#define WARP_SIZE_GGUF 32
#define CUDA_DEQUANTIZE_BLOCK_SIZE 256
#define CUDA_QUANTIZE_BLOCK_SIZE 256

typedef struct {
    half    d;              // delta
    uint8_t qs[QK4_0 / 2]; // nibbles / quants
} block_q4_0;

typedef struct {
    half2   dm;             // dm.x = delta, dm.y = min
    uint8_t qs[QK4_1 / 2]; // nibbles / quants
} block_q4_1;

typedef struct {
    half    d;              // delta
    int8_t  qs[QK8_0];     // quants
} block_q8_0;

typedef struct {
    uint8_t scales[QK_K/16]; // scales and mins, quantized with 4 bits
    uint8_t qs[QK_K/4];      // quants
    half2 dm;                // super-block scale for quantized scales/mins
} block_q2_K;

typedef struct {
    half2 dm;                  // super-block scale
    uint8_t scales[3*QK_K/64]; // scales, quantized with 6 bits
    uint8_t qs[QK_K/2];        // 4-bit quants
} block_q4_K;

typedef struct {
    uint8_t ql[QK_K/2];   // quants, lower 4 bits
    uint8_t qh[QK_K/4];   // quants, upper 2 bits
    int8_t  scales[QK_K/16]; // scales
    half    d;             // delta
} block_q6_K;

Import

#include "ggml-common.h"

I/O Contract

Inputs

Name Type Required Description
N/A N/A N/A This is a header-only file defining data structures; it does not accept runtime inputs directly.

Outputs

Name Type Description
block_q4_0 struct 4-bit quantization block with 32-element groups (delta + nibbles)
block_q4_1 struct 4-bit quantization block with delta and min values
block_q5_0 struct 5-bit quantization block with 32-element groups
block_q8_0 struct 8-bit quantization block with delta scaling
block_q2_K struct 2-bit K-quantization super-block (256 elements)
block_q3_K struct 3-bit K-quantization super-block
block_q4_K struct 4-bit K-quantization super-block
block_q5_K struct 5-bit K-quantization super-block
block_q6_K struct 6-bit K-quantization super-block
block_iq2_xxs struct IQ 2-bit extra-extra-small quantization block
block_iq3_s struct IQ 3-bit small quantization block

Usage Examples

// Dequantize a block_q4_0 structure
block_q4_0 block;
half delta = block.d;
// Each byte in qs holds two 4-bit quantized values
uint8_t nibble_pair = block.qs[0];
uint8_t lo = nibble_pair & 0x0F;
uint8_t hi = (nibble_pair >> 4) & 0x0F;
// Dequantize: value = (quant - 8) * delta

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment