Implementation:Vllm project Vllm Marlin Generate Kernels

Knowledge Sources	vllm
Domains	Quantization, Marlin, Code_Generation
Last Updated	2026-02-08 00:00 GMT

Overview

Python build-time script that generates Marlin CUDA kernel instantiations for various quantization formats, thread configurations, and GPU architectures using Jinja2 templates.

Description

This script generates architecture-specific Marlin kernel specialization files (.cu) and a kernel_selector.h header for runtime dispatch. It processes a set of QUANT_CONFIGS covering AWQ-INT4, GPTQ-INT4, GPTQ-INT8, FP8, NVFP4, and MXFP4 quantization schemes with multiple activation types (FP16, BF16, INT8, FP8). The generator iterates over all combinations of thread configurations (THREAD_CONFIGS), M-block sizes (THREAD_M_BLOCKS), and group block sizes, producing separate kernel files for SM75, SM80, and SM89 targets. It accepts target GPU architectures as a comma-separated command-line argument to selectively enable FP8 and architecture-specific kernel generation.

Usage

This script is executed during the vLLM build process, receiving the target CUDA architectures as a command-line argument. It removes previously generated kernel files and writes new ones to the Marlin source directory.

Code Reference

Source Location

Repository: vllm
File: csrc/quantization/marlin/generate_kernels.py
Lines: 1-307

Signature

THREAD_CONFIGS = [
    (128, 128, 256), (64, 256, 256),
    (64, 128, 128), (128, 64, 128)
]

THREAD_M_BLOCKS = [0.5, 1, 2, 3, 4]

QUANT_CONFIGS = [
    {"b_type": "kU4", ...},      # AWQ-INT4
    {"b_type": "kU4B8", ...},    # GPTQ-INT4
    {"b_type": "kU8B128", ...},  # GPTQ-INT8
    {"b_type": "kFE4M3fn", ...}, # FP8
    {"b_type": "kFE2M1f", ...},  # NVFP4
    # ... additional configs for mixed activation types
]

def remove_old_kernels() -> None: ...
def generate_new_kernels() -> None: ...

Import

# This is a build-time code generator script; it is not imported at runtime.
# It is executed via:
python csrc/quantization/marlin/generate_kernels.py "8.0,8.9"

I/O Contract

Inputs

Name	Type	Required	Description
sys.argv[1]	str	Yes	Comma-separated list of target CUDA compute capabilities (e.g., "8.0,8.9,9.0")
QUANT_CONFIGS	list[dict]	Yes	Built-in list of quantization configurations defining b_type, thread configs, m_blocks, and group_blocks
TEMPLATE	str (Jinja2)	Yes	Jinja2 template string for Marlin kernel instantiation

Outputs

Name	Type	Description
sm80_kernel_*.cu	CUDA source files	Generated kernel instantiation files for SM80+ architectures
sm75_kernel_*.cu	CUDA source files	Generated kernel instantiation files for SM75 (Turing) with 2-stage pipeline
sm89_kernel_*.cu	CUDA source files	Generated kernel instantiation files for SM89 (FP8-capable) architectures
kernel_selector.h	C++ header	Generated dispatch header with if/else chains mapping runtime parameters to kernel templates

Usage Examples

# Build-time invocation with target architectures
# Generates kernels for Ampere (8.0) and Ada Lovelace (8.9)
import subprocess
subprocess.run([
    "python",
    "csrc/quantization/marlin/generate_kernels.py",
    "8.0,8.9"
])

# The script generates files like:
#   sm80_kernel_float16_u4_float16.cu
#   sm80_kernel_float16_u4b8_float16.cu
#   sm89_kernel_fe4m3fn_u4b8_float16.cu
#   kernel_selector.h

Related Pages

Environment:Vllm_project_Vllm_CUDA_GPU_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment