Implementation:Vllm project Vllm Scaled MM Epilogues C2X

Knowledge Sources	vllm
Domains	CUTLASS, Epilogue, Quantization, GEMM
Last Updated	2026-02-08 00:00 GMT

Overview

Defines complete CUTLASS 2.x epilogue configurations for scaled and quantized matrix multiplications on Ampere (SM80) GPUs, supporting symmetric and asymmetric quantization with optional bias and activation zero-points.

Description

This header composes broadcast load visitors from broadcast_load_epilogue_c2x.hpp with CUTLASS epilogue compute nodes (Sm80EVT) to implement dequantization formulas directly within the GEMM epilogue, avoiding separate post-processing kernels. It supports four epilogue variants: basic scaled (ScaledEpilogue), scaled with bias (ScaledEpilogueBias), scaled with bias and per-tensor activation zero-point (ScaledEpilogueBiasAzp), and scaled with bias and per-token activation zero-point (ScaledEpilogueBiasAzpToken). Each class exposes an EVTCompute type and a static prepare_args method for constructing arguments from PyTorch tensors.

Usage

This header is included during compilation of CUTLASS-based quantized GEMM kernels for Ampere GPUs. It is used by the vLLM cutlass_scaled_mm and cutlass_scaled_mm_azp operations when running on SM80 hardware.

Code Reference

Source Location

Repository: vllm
File: csrc/cutlass_extensions/epilogue/scaled_mm_epilogues_c2x.hpp
Lines: 1-321

Signature

namespace vllm::c2x {

template <typename ElementD, typename OutputTileThreadMap>
struct ScaledEpilogueBase { /* common load descriptors */ };

template <typename ElementD, typename OutputTileThreadMap>
struct ScaledEpilogue : private ScaledEpilogueBase<ElementD, OutputTileThreadMap> {
  using EVTCompute = Sm80EVT<Compute1, ScaleA, EVTCompute0>;
  static ArgumentType prepare_args(torch::Tensor const& a_scales,
                                   torch::Tensor const& b_scales);
};

template <typename ElementD, typename OutputTileThreadMap>
struct ScaledEpilogueBias : protected ScaledEpilogueBase<ElementD, OutputTileThreadMap> {
  using EVTCompute = Sm80EVT<Compute1, ScaleA, EVTCompute0, Bias>;
  static ArgumentType prepare_args(torch::Tensor const& a_scales,
                                   torch::Tensor const& b_scales,
                                   torch::Tensor const& bias);
};

template <typename ElementD, typename OutputTileThreadMap>
struct ScaledEpilogueBiasAzp : protected ScaledEpilogueBase<ElementD, OutputTileThreadMap> {
  using EVTCompute = Sm80EVT<ComputeScaleBiasA, ScaleA, EVTComputeScaleB, Bias>;
  static ArgumentType prepare_args(torch::Tensor const& a_scales,
                                   torch::Tensor const& b_scales,
                                   torch::Tensor const& azp_adj,
                                   std::optional<torch::Tensor> const& bias);
};

template <typename ElementD, typename OutputTileThreadMap>
struct ScaledEpilogueBiasAzpToken : protected ScaledEpilogueBase<ElementD, OutputTileThreadMap> {
  static ArgumentType prepare_args(torch::Tensor const& a_scales,
                                   torch::Tensor const& b_scales,
                                   torch::Tensor const& azp_adj,
                                   torch::Tensor const& azp,
                                   std::optional<torch::Tensor> const& bias);
};

} // namespace vllm::c2x

Import

#include "cutlass_extensions/epilogue/scaled_mm_epilogues_c2x.hpp"

I/O Contract

Inputs

Name	Type	Required	Description
a_scales	torch::Tensor	Yes	Quantization scales for operand A; scalar (per-tensor) or column vector (per-row)
b_scales	torch::Tensor	Yes	Quantization scales for operand B; scalar (per-tensor) or row vector (per-column)
bias	torch::Tensor / optional	No	Per-output-channel bias tensor (row vector of shape 1xN)
azp_adj	torch::Tensor	No	Activation zero-point adjustment term of shape (1,N), computed as azp * J @ B
azp	torch::Tensor	No	Per-token activation zero-points of shape (M,1), used with AzpToken variant
ElementD	template param	Yes	Output element type (e.g., half, bfloat16)
OutputTileThreadMap	template param	Yes	CUTLASS thread map defining the output tile partitioning

Outputs

Name	Type	Description
EVTCompute::Arguments	ArgumentType	Fully constructed epilogue arguments struct ready for kernel launch

Usage Examples

// ScaledEpilogue: D = (a_scales * A) @ (b_scales * B)
using Epilogue = vllm::c2x::ScaledEpilogue<cutlass::half_t, OutputTileThreadMap>;
auto args = Epilogue::prepare_args(a_scales_tensor, b_scales_tensor);

// ScaledEpilogueBias: D = (a_scales * A) @ (b_scales * B) + bias
using EpilogueBias = vllm::c2x::ScaledEpilogueBias<cutlass::half_t, OutputTileThreadMap>;
auto bias_args = EpilogueBias::prepare_args(a_scales, b_scales, bias);

// ScaledEpilogueBiasAzp: asymmetric quantization with per-tensor zero-point
using EpilogueAzp = vllm::c2x::ScaledEpilogueBiasAzp<cutlass::half_t, OutputTileThreadMap>;
auto azp_args = EpilogueAzp::prepare_args(a_scales, b_scales, azp_adj, bias);

Related Pages

Environment:Vllm_project_Vllm_CUDA_GPU_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment