Implementation:Vllm project Vllm Marlin Generate Kernels
| Knowledge Sources | |
|---|---|
| Domains | Quantization, Marlin, Code_Generation |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Python build-time script that generates Marlin CUDA kernel instantiations for various quantization formats, thread configurations, and GPU architectures using Jinja2 templates.
Description
This script generates architecture-specific Marlin kernel specialization files (.cu) and a kernel_selector.h header for runtime dispatch. It processes a set of QUANT_CONFIGS covering AWQ-INT4, GPTQ-INT4, GPTQ-INT8, FP8, NVFP4, and MXFP4 quantization schemes with multiple activation types (FP16, BF16, INT8, FP8). The generator iterates over all combinations of thread configurations (THREAD_CONFIGS), M-block sizes (THREAD_M_BLOCKS), and group block sizes, producing separate kernel files for SM75, SM80, and SM89 targets. It accepts target GPU architectures as a comma-separated command-line argument to selectively enable FP8 and architecture-specific kernel generation.
Usage
This script is executed during the vLLM build process, receiving the target CUDA architectures as a command-line argument. It removes previously generated kernel files and writes new ones to the Marlin source directory.
Code Reference
Source Location
- Repository: vllm
- File: csrc/quantization/marlin/generate_kernels.py
- Lines: 1-307
Signature
THREAD_CONFIGS = [
(128, 128, 256), (64, 256, 256),
(64, 128, 128), (128, 64, 128)
]
THREAD_M_BLOCKS = [0.5, 1, 2, 3, 4]
QUANT_CONFIGS = [
{"b_type": "kU4", ...}, # AWQ-INT4
{"b_type": "kU4B8", ...}, # GPTQ-INT4
{"b_type": "kU8B128", ...}, # GPTQ-INT8
{"b_type": "kFE4M3fn", ...}, # FP8
{"b_type": "kFE2M1f", ...}, # NVFP4
# ... additional configs for mixed activation types
]
def remove_old_kernels() -> None: ...
def generate_new_kernels() -> None: ...
Import
# This is a build-time code generator script; it is not imported at runtime.
# It is executed via:
python csrc/quantization/marlin/generate_kernels.py "8.0,8.9"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| sys.argv[1] | str | Yes | Comma-separated list of target CUDA compute capabilities (e.g., "8.0,8.9,9.0") |
| QUANT_CONFIGS | list[dict] | Yes | Built-in list of quantization configurations defining b_type, thread configs, m_blocks, and group_blocks |
| TEMPLATE | str (Jinja2) | Yes | Jinja2 template string for Marlin kernel instantiation |
Outputs
| Name | Type | Description |
|---|---|---|
| sm80_kernel_*.cu | CUDA source files | Generated kernel instantiation files for SM80+ architectures |
| sm75_kernel_*.cu | CUDA source files | Generated kernel instantiation files for SM75 (Turing) with 2-stage pipeline |
| sm89_kernel_*.cu | CUDA source files | Generated kernel instantiation files for SM89 (FP8-capable) architectures |
| kernel_selector.h | C++ header | Generated dispatch header with if/else chains mapping runtime parameters to kernel templates |
Usage Examples
# Build-time invocation with target architectures
# Generates kernels for Ampere (8.0) and Ada Lovelace (8.9)
import subprocess
subprocess.run([
"python",
"csrc/quantization/marlin/generate_kernels.py",
"8.0,8.9"
])
# The script generates files like:
# sm80_kernel_float16_u4_float16.cu
# sm80_kernel_float16_u4b8_float16.cu
# sm89_kernel_fe4m3fn_u4b8_float16.cu
# kernel_selector.h