Implementation:Sgl project Sglang GGML Common

Knowledge Sources	Sgl_project_Sglang
Domains	Quantization, GGUF_Format, Data_Structures
Last Updated	2026-02-10 00:00 GMT

Overview

Foundation header defining data structures and constants for GGUF/GGML quantization formats, enabling SGLang to load and dequantize llama.cpp-compatible quantized models.

Description

The ggml-common.h header, adapted from vLLM and originally from llama.cpp, defines the block structures and constants used across all GGML quantization types. The file establishes:

Global constants:

QK_K = 256 -- super-block size for K-quant types
K_QUANTS_PER_ITERATION = 2 -- quants processed per iteration
WARP_SIZE_GGUF = 32 -- warp size for CUDA kernels
CUDA_DEQUANTIZE_BLOCK_SIZE = 256 and CUDA_QUANTIZE_BLOCK_SIZE = 256 -- CUDA thread block sizes

Basic quantization block structures:

block_q4_0 / block_q4_1 -- 4-bit quantization with 32-element blocks. q4_0 stores delta (half) and nibble-packed quants; q4_1 adds a min value via half2
block_q5_0 / block_q5_1 -- 5-bit quantization adding a high-bit array
block_q8_0 / block_q8_1 -- 8-bit quantization with per-block scale factors

K-quant structures (QK_K=256 super-blocks):

block_q2_K -- 2-bit quantization with 4-bit quantized scales/mins and half2 super-block scale
block_q3_K -- 3-bit quantization with high-bit mask, low 2-bit quants, and packed scales
block_q4_K -- 4-bit K-quant with half2 scale/min and 6-bit packed scales
block_q5_K -- 5-bit K-quant with high-bit masks
block_q6_K -- 6-bit quantization with separate high/low bit fields and int8 scales

Each block type has associated constants QK* (values per block after dequant), QR* (ratio of dequant to quant values), and QI* (number of int32s before dequant).

Usage

Include this header in any CUDA kernel or C++ code that needs to work with GGUF-format quantized model weights. It provides the struct definitions needed for loading, indexing, and dequantizing quantized data blocks.

Code Reference

Source Location

Repository: Sgl_project_Sglang
File: sgl-kernel/csrc/quantization/gguf/ggml-common.h
Lines: 1-1029

Signature

// Global constants
#define QK_K 256
#define K_QUANTS_PER_ITERATION 2
#define WARP_SIZE_GGUF 32
#define CUDA_DEQUANTIZE_BLOCK_SIZE 256

// 4-bit quantization block (32 elements)
#define QK4_0 32
typedef struct {
  half d;                   // delta (scale factor)
  uint8_t qs[QK4_0 / 2];   // nibbles / quants
} block_q4_0;

// 4-bit with min value
typedef struct {
  half2 dm;                 // dm.x = delta, dm.y = min
  uint8_t qs[QK4_1 / 2];   // nibbles / quants
} block_q4_1;

// 8-bit quantization block
typedef struct {
  half d;                   // delta
  int8_t qs[QK8_0];        // quants
} block_q8_0;

// K-quant: 2-bit with 256-element super-blocks
typedef struct {
  uint8_t scales[QK_K / 16];  // quantized scales and mins
  uint8_t qs[QK_K / 4];       // quants
  half2 dm;                    // super-block scale
} block_q2_K;

// K-quant: 4-bit with 256-element super-blocks
typedef struct {
  half2 dm;                    // super-block scale and min
  uint8_t scales[K_SCALE_SIZE]; // packed scales
  uint8_t qs[QK_K / 2];       // nibble quants
} block_q4_K;

// K-quant: 6-bit
typedef struct {
  uint8_t ql[QK_K / 2];       // low 4 bits of quants
  uint8_t qh[QK_K / 4];       // high 2 bits of quants
  int8_t scales[QK_K / 16];   // scales (int8)
  half d;                      // super-block scale
} block_q6_K;

Import

#include "ggml-common.h"

I/O Contract

Inputs

Name	Type	Required	Description
(header only)	N/A	N/A	This is a definitions-only header; no runtime inputs

Outputs

Name	Type	Description
block_q4_0	struct	4-bit quantization block with delta and nibble-packed quants
block_q4_1	struct	4-bit block with delta, min, and nibble-packed quants
block_q8_0	struct	8-bit quantization block with delta and int8 quants
block_q2_K	struct	2-bit K-quant super-block (256 elements)
block_q4_K	struct	4-bit K-quant super-block (256 elements)
block_q6_K	struct	6-bit K-quant super-block (256 elements)

Usage Examples

// Accessing a q4_0 quantized block
const block_q4_0* block = (const block_q4_0*)data_ptr + block_idx;
float scale = __half2float(block->d);

// Dequantize a nibble pair from q4_0
uint8_t packed = block->qs[i];
float val_lo = (float)(packed & 0x0F) * scale;
float val_hi = (float)(packed >> 4) * scale;

// Accessing a K-quant q4_K super-block
const block_q4_K* kb = (const block_q4_K*)data_ptr + block_idx;
float d = __half2float(kb->dm.x);
float m = __half2float(kb->dm.y);

Related Pages

Environment:Sgl_project_Sglang_CUDA_GPU_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment