Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Sgl project Sglang GGML Common

From Leeroopedia


Knowledge Sources
Domains Quantization, GGUF_Format, Data_Structures
Last Updated 2026-02-10 00:00 GMT

Overview

Foundation header defining data structures and constants for GGUF/GGML quantization formats, enabling SGLang to load and dequantize llama.cpp-compatible quantized models.

Description

The ggml-common.h header, adapted from vLLM and originally from llama.cpp, defines the block structures and constants used across all GGML quantization types. The file establishes:

Global constants:

  • QK_K = 256 -- super-block size for K-quant types
  • K_QUANTS_PER_ITERATION = 2 -- quants processed per iteration
  • WARP_SIZE_GGUF = 32 -- warp size for CUDA kernels
  • CUDA_DEQUANTIZE_BLOCK_SIZE = 256 and CUDA_QUANTIZE_BLOCK_SIZE = 256 -- CUDA thread block sizes

Basic quantization block structures:

  • block_q4_0 / block_q4_1 -- 4-bit quantization with 32-element blocks. q4_0 stores delta (half) and nibble-packed quants; q4_1 adds a min value via half2
  • block_q5_0 / block_q5_1 -- 5-bit quantization adding a high-bit array
  • block_q8_0 / block_q8_1 -- 8-bit quantization with per-block scale factors

K-quant structures (QK_K=256 super-blocks):

  • block_q2_K -- 2-bit quantization with 4-bit quantized scales/mins and half2 super-block scale
  • block_q3_K -- 3-bit quantization with high-bit mask, low 2-bit quants, and packed scales
  • block_q4_K -- 4-bit K-quant with half2 scale/min and 6-bit packed scales
  • block_q5_K -- 5-bit K-quant with high-bit masks
  • block_q6_K -- 6-bit quantization with separate high/low bit fields and int8 scales

Each block type has associated constants QK* (values per block after dequant), QR* (ratio of dequant to quant values), and QI* (number of int32s before dequant).

Usage

Include this header in any CUDA kernel or C++ code that needs to work with GGUF-format quantized model weights. It provides the struct definitions needed for loading, indexing, and dequantizing quantized data blocks.

Code Reference

Source Location

Signature

// Global constants
#define QK_K 256
#define K_QUANTS_PER_ITERATION 2
#define WARP_SIZE_GGUF 32
#define CUDA_DEQUANTIZE_BLOCK_SIZE 256

// 4-bit quantization block (32 elements)
#define QK4_0 32
typedef struct {
  half d;                   // delta (scale factor)
  uint8_t qs[QK4_0 / 2];   // nibbles / quants
} block_q4_0;

// 4-bit with min value
typedef struct {
  half2 dm;                 // dm.x = delta, dm.y = min
  uint8_t qs[QK4_1 / 2];   // nibbles / quants
} block_q4_1;

// 8-bit quantization block
typedef struct {
  half d;                   // delta
  int8_t qs[QK8_0];        // quants
} block_q8_0;

// K-quant: 2-bit with 256-element super-blocks
typedef struct {
  uint8_t scales[QK_K / 16];  // quantized scales and mins
  uint8_t qs[QK_K / 4];       // quants
  half2 dm;                    // super-block scale
} block_q2_K;

// K-quant: 4-bit with 256-element super-blocks
typedef struct {
  half2 dm;                    // super-block scale and min
  uint8_t scales[K_SCALE_SIZE]; // packed scales
  uint8_t qs[QK_K / 2];       // nibble quants
} block_q4_K;

// K-quant: 6-bit
typedef struct {
  uint8_t ql[QK_K / 2];       // low 4 bits of quants
  uint8_t qh[QK_K / 4];       // high 2 bits of quants
  int8_t scales[QK_K / 16];   // scales (int8)
  half d;                      // super-block scale
} block_q6_K;

Import

#include "ggml-common.h"

I/O Contract

Inputs

Name Type Required Description
(header only) N/A N/A This is a definitions-only header; no runtime inputs

Outputs

Name Type Description
block_q4_0 struct 4-bit quantization block with delta and nibble-packed quants
block_q4_1 struct 4-bit block with delta, min, and nibble-packed quants
block_q8_0 struct 8-bit quantization block with delta and int8 quants
block_q2_K struct 2-bit K-quant super-block (256 elements)
block_q4_K struct 4-bit K-quant super-block (256 elements)
block_q6_K struct 6-bit K-quant super-block (256 elements)

Usage Examples

// Accessing a q4_0 quantized block
const block_q4_0* block = (const block_q4_0*)data_ptr + block_idx;
float scale = __half2float(block->d);

// Dequantize a nibble pair from q4_0
uint8_t packed = block->qs[i];
float val_lo = (float)(packed & 0x0F) * scale;
float val_hi = (float)(packed >> 4) * scale;

// Accessing a K-quant q4_K super-block
const block_q4_K* kb = (const block_q4_K*)data_ptr + block_idx;
float d = __half2float(kb->dm.x);
float m = __half2float(kb->dm.y);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment