Implementation:Sgl project Sglang GGML Common
| Knowledge Sources | |
|---|---|
| Domains | Quantization, GGUF_Format, Data_Structures |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Foundation header defining data structures and constants for GGUF/GGML quantization formats, enabling SGLang to load and dequantize llama.cpp-compatible quantized models.
Description
The ggml-common.h header, adapted from vLLM and originally from llama.cpp, defines the block structures and constants used across all GGML quantization types. The file establishes:
Global constants:
- QK_K = 256 -- super-block size for K-quant types
- K_QUANTS_PER_ITERATION = 2 -- quants processed per iteration
- WARP_SIZE_GGUF = 32 -- warp size for CUDA kernels
- CUDA_DEQUANTIZE_BLOCK_SIZE = 256 and CUDA_QUANTIZE_BLOCK_SIZE = 256 -- CUDA thread block sizes
Basic quantization block structures:
- block_q4_0 / block_q4_1 -- 4-bit quantization with 32-element blocks. q4_0 stores delta (half) and nibble-packed quants; q4_1 adds a min value via half2
- block_q5_0 / block_q5_1 -- 5-bit quantization adding a high-bit array
- block_q8_0 / block_q8_1 -- 8-bit quantization with per-block scale factors
K-quant structures (QK_K=256 super-blocks):
- block_q2_K -- 2-bit quantization with 4-bit quantized scales/mins and half2 super-block scale
- block_q3_K -- 3-bit quantization with high-bit mask, low 2-bit quants, and packed scales
- block_q4_K -- 4-bit K-quant with half2 scale/min and 6-bit packed scales
- block_q5_K -- 5-bit K-quant with high-bit masks
- block_q6_K -- 6-bit quantization with separate high/low bit fields and int8 scales
Each block type has associated constants QK* (values per block after dequant), QR* (ratio of dequant to quant values), and QI* (number of int32s before dequant).
Usage
Include this header in any CUDA kernel or C++ code that needs to work with GGUF-format quantized model weights. It provides the struct definitions needed for loading, indexing, and dequantizing quantized data blocks.
Code Reference
Source Location
- Repository: Sgl_project_Sglang
- File: sgl-kernel/csrc/quantization/gguf/ggml-common.h
- Lines: 1-1029
Signature
// Global constants
#define QK_K 256
#define K_QUANTS_PER_ITERATION 2
#define WARP_SIZE_GGUF 32
#define CUDA_DEQUANTIZE_BLOCK_SIZE 256
// 4-bit quantization block (32 elements)
#define QK4_0 32
typedef struct {
half d; // delta (scale factor)
uint8_t qs[QK4_0 / 2]; // nibbles / quants
} block_q4_0;
// 4-bit with min value
typedef struct {
half2 dm; // dm.x = delta, dm.y = min
uint8_t qs[QK4_1 / 2]; // nibbles / quants
} block_q4_1;
// 8-bit quantization block
typedef struct {
half d; // delta
int8_t qs[QK8_0]; // quants
} block_q8_0;
// K-quant: 2-bit with 256-element super-blocks
typedef struct {
uint8_t scales[QK_K / 16]; // quantized scales and mins
uint8_t qs[QK_K / 4]; // quants
half2 dm; // super-block scale
} block_q2_K;
// K-quant: 4-bit with 256-element super-blocks
typedef struct {
half2 dm; // super-block scale and min
uint8_t scales[K_SCALE_SIZE]; // packed scales
uint8_t qs[QK_K / 2]; // nibble quants
} block_q4_K;
// K-quant: 6-bit
typedef struct {
uint8_t ql[QK_K / 2]; // low 4 bits of quants
uint8_t qh[QK_K / 4]; // high 2 bits of quants
int8_t scales[QK_K / 16]; // scales (int8)
half d; // super-block scale
} block_q6_K;
Import
#include "ggml-common.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| (header only) | N/A | N/A | This is a definitions-only header; no runtime inputs |
Outputs
| Name | Type | Description |
|---|---|---|
| block_q4_0 | struct | 4-bit quantization block with delta and nibble-packed quants |
| block_q4_1 | struct | 4-bit block with delta, min, and nibble-packed quants |
| block_q8_0 | struct | 8-bit quantization block with delta and int8 quants |
| block_q2_K | struct | 2-bit K-quant super-block (256 elements) |
| block_q4_K | struct | 4-bit K-quant super-block (256 elements) |
| block_q6_K | struct | 6-bit K-quant super-block (256 elements) |
Usage Examples
// Accessing a q4_0 quantized block
const block_q4_0* block = (const block_q4_0*)data_ptr + block_idx;
float scale = __half2float(block->d);
// Dequantize a nibble pair from q4_0
uint8_t packed = block->qs[i];
float val_lo = (float)(packed & 0x0F) * scale;
float val_hi = (float)(packed >> 4) * scale;
// Accessing a K-quant q4_K super-block
const block_q4_K* kb = (const block_q4_K*)data_ptr + block_idx;
float d = __half2float(kb->dm.x);
float m = __half2float(kb->dm.y);