Implementation:Vllm project Vllm GGML Common

Knowledge Sources	vllm
Domains	Quantization, GGUF
Last Updated	2026-02-08 00:00 GMT

Overview

Defines GGUF/GGML quantization block data structures for 2-bit through 8-bit quantization formats, enabling compatibility with llama.cpp quantized models.

Description

This header file, originally adapted from llama.cpp, declares a comprehensive set of quantization block structures (block_q4_0, block_q4_1, block_q5_0, block_q5_1, block_q8_0, block_q8_1, block_q2_K through block_q6_K, and various block_iq types) used for K-quantization formats. Each structure contains a delta/scale value (typically a half or half2) and packed quantized weight nibbles or bytes. The file also defines constants such as QK_K (256 values per super-block), WARP_SIZE_GGUF (32), and CUDA block sizes for dequantization and quantization kernels.

Usage

This header is included at compile time by GGUF dequantization CUDA kernels in the vLLM C++ source tree. It is used whenever vLLM loads and processes models stored in the GGUF quantized format, providing the memory layout definitions needed to correctly interpret quantized weight data.

Code Reference

Source Location

Repository: vllm
File: csrc/quantization/gguf/ggml-common.h
Lines: 1-1150

Signature

#define QK_K 256
#define K_QUANTS_PER_ITERATION 2
#define WARP_SIZE_GGUF 32
#define CUDA_DEQUANTIZE_BLOCK_SIZE 256
#define CUDA_QUANTIZE_BLOCK_SIZE 256

typedef struct {
    half    d;              // delta
    uint8_t qs[QK4_0 / 2]; // nibbles / quants
} block_q4_0;

typedef struct {
    half2   dm;             // dm.x = delta, dm.y = min
    uint8_t qs[QK4_1 / 2]; // nibbles / quants
} block_q4_1;

typedef struct {
    half    d;              // delta
    int8_t  qs[QK8_0];     // quants
} block_q8_0;

typedef struct {
    uint8_t scales[QK_K/16]; // scales and mins, quantized with 4 bits
    uint8_t qs[QK_K/4];      // quants
    half2 dm;                // super-block scale for quantized scales/mins
} block_q2_K;

typedef struct {
    half2 dm;                  // super-block scale
    uint8_t scales[3*QK_K/64]; // scales, quantized with 6 bits
    uint8_t qs[QK_K/2];        // 4-bit quants
} block_q4_K;

typedef struct {
    uint8_t ql[QK_K/2];   // quants, lower 4 bits
    uint8_t qh[QK_K/4];   // quants, upper 2 bits
    int8_t  scales[QK_K/16]; // scales
    half    d;             // delta
} block_q6_K;

Import

#include "ggml-common.h"

I/O Contract

Inputs

Name	Type	Required	Description
N/A	N/A	N/A	This is a header-only file defining data structures; it does not accept runtime inputs directly.

Outputs

Name	Type	Description
block_q4_0	struct	4-bit quantization block with 32-element groups (delta + nibbles)
block_q4_1	struct	4-bit quantization block with delta and min values
block_q5_0	struct	5-bit quantization block with 32-element groups
block_q8_0	struct	8-bit quantization block with delta scaling
block_q2_K	struct	2-bit K-quantization super-block (256 elements)
block_q3_K	struct	3-bit K-quantization super-block
block_q4_K	struct	4-bit K-quantization super-block
block_q5_K	struct	5-bit K-quantization super-block
block_q6_K	struct	6-bit K-quantization super-block
block_iq2_xxs	struct	IQ 2-bit extra-extra-small quantization block
block_iq3_s	struct	IQ 3-bit small quantization block

Usage Examples

// Dequantize a block_q4_0 structure
block_q4_0 block;
half delta = block.d;
// Each byte in qs holds two 4-bit quantized values
uint8_t nibble_pair = block.qs[0];
uint8_t lo = nibble_pair & 0x0F;
uint8_t hi = (nibble_pair >> 4) & 0x0F;
// Dequantize: value = (quant - 8) * delta

Related Pages

Environment:Vllm_project_Vllm_CUDA_GPU_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment