Implementation:Vllm project Vllm GGML Common
| Knowledge Sources | |
|---|---|
| Domains | Quantization, GGUF |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Defines GGUF/GGML quantization block data structures for 2-bit through 8-bit quantization formats, enabling compatibility with llama.cpp quantized models.
Description
This header file, originally adapted from llama.cpp, declares a comprehensive set of quantization block structures (block_q4_0, block_q4_1, block_q5_0, block_q5_1, block_q8_0, block_q8_1, block_q2_K through block_q6_K, and various block_iq types) used for K-quantization formats. Each structure contains a delta/scale value (typically a half or half2) and packed quantized weight nibbles or bytes. The file also defines constants such as QK_K (256 values per super-block), WARP_SIZE_GGUF (32), and CUDA block sizes for dequantization and quantization kernels.
Usage
This header is included at compile time by GGUF dequantization CUDA kernels in the vLLM C++ source tree. It is used whenever vLLM loads and processes models stored in the GGUF quantized format, providing the memory layout definitions needed to correctly interpret quantized weight data.
Code Reference
Source Location
- Repository: vllm
- File: csrc/quantization/gguf/ggml-common.h
- Lines: 1-1150
Signature
#define QK_K 256
#define K_QUANTS_PER_ITERATION 2
#define WARP_SIZE_GGUF 32
#define CUDA_DEQUANTIZE_BLOCK_SIZE 256
#define CUDA_QUANTIZE_BLOCK_SIZE 256
typedef struct {
half d; // delta
uint8_t qs[QK4_0 / 2]; // nibbles / quants
} block_q4_0;
typedef struct {
half2 dm; // dm.x = delta, dm.y = min
uint8_t qs[QK4_1 / 2]; // nibbles / quants
} block_q4_1;
typedef struct {
half d; // delta
int8_t qs[QK8_0]; // quants
} block_q8_0;
typedef struct {
uint8_t scales[QK_K/16]; // scales and mins, quantized with 4 bits
uint8_t qs[QK_K/4]; // quants
half2 dm; // super-block scale for quantized scales/mins
} block_q2_K;
typedef struct {
half2 dm; // super-block scale
uint8_t scales[3*QK_K/64]; // scales, quantized with 6 bits
uint8_t qs[QK_K/2]; // 4-bit quants
} block_q4_K;
typedef struct {
uint8_t ql[QK_K/2]; // quants, lower 4 bits
uint8_t qh[QK_K/4]; // quants, upper 2 bits
int8_t scales[QK_K/16]; // scales
half d; // delta
} block_q6_K;
Import
#include "ggml-common.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| N/A | N/A | N/A | This is a header-only file defining data structures; it does not accept runtime inputs directly. |
Outputs
| Name | Type | Description |
|---|---|---|
| block_q4_0 | struct | 4-bit quantization block with 32-element groups (delta + nibbles) |
| block_q4_1 | struct | 4-bit quantization block with delta and min values |
| block_q5_0 | struct | 5-bit quantization block with 32-element groups |
| block_q8_0 | struct | 8-bit quantization block with delta scaling |
| block_q2_K | struct | 2-bit K-quantization super-block (256 elements) |
| block_q3_K | struct | 3-bit K-quantization super-block |
| block_q4_K | struct | 4-bit K-quantization super-block |
| block_q5_K | struct | 5-bit K-quantization super-block |
| block_q6_K | struct | 6-bit K-quantization super-block |
| block_iq2_xxs | struct | IQ 2-bit extra-extra-small quantization block |
| block_iq3_s | struct | IQ 3-bit small quantization block |
Usage Examples
// Dequantize a block_q4_0 structure
block_q4_0 block;
half delta = block.d;
// Each byte in qs holds two 4-bit quantized values
uint8_t nibble_pair = block.qs[0];
uint8_t lo = nibble_pair & 0x0F;
uint8_t hi = (nibble_pair >> 4) & 0x0F;
// Dequantize: value = (quant - 8) * delta