Implementation:Vllm project Vllm Scaled MM Epilogues C2X
| Knowledge Sources | |
|---|---|
| Domains | CUTLASS, Epilogue, Quantization, GEMM |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Defines complete CUTLASS 2.x epilogue configurations for scaled and quantized matrix multiplications on Ampere (SM80) GPUs, supporting symmetric and asymmetric quantization with optional bias and activation zero-points.
Description
This header composes broadcast load visitors from broadcast_load_epilogue_c2x.hpp with CUTLASS epilogue compute nodes (Sm80EVT) to implement dequantization formulas directly within the GEMM epilogue, avoiding separate post-processing kernels. It supports four epilogue variants: basic scaled (ScaledEpilogue), scaled with bias (ScaledEpilogueBias), scaled with bias and per-tensor activation zero-point (ScaledEpilogueBiasAzp), and scaled with bias and per-token activation zero-point (ScaledEpilogueBiasAzpToken). Each class exposes an EVTCompute type and a static prepare_args method for constructing arguments from PyTorch tensors.
Usage
This header is included during compilation of CUTLASS-based quantized GEMM kernels for Ampere GPUs. It is used by the vLLM cutlass_scaled_mm and cutlass_scaled_mm_azp operations when running on SM80 hardware.
Code Reference
Source Location
- Repository: vllm
- File: csrc/cutlass_extensions/epilogue/scaled_mm_epilogues_c2x.hpp
- Lines: 1-321
Signature
namespace vllm::c2x {
template <typename ElementD, typename OutputTileThreadMap>
struct ScaledEpilogueBase { /* common load descriptors */ };
template <typename ElementD, typename OutputTileThreadMap>
struct ScaledEpilogue : private ScaledEpilogueBase<ElementD, OutputTileThreadMap> {
using EVTCompute = Sm80EVT<Compute1, ScaleA, EVTCompute0>;
static ArgumentType prepare_args(torch::Tensor const& a_scales,
torch::Tensor const& b_scales);
};
template <typename ElementD, typename OutputTileThreadMap>
struct ScaledEpilogueBias : protected ScaledEpilogueBase<ElementD, OutputTileThreadMap> {
using EVTCompute = Sm80EVT<Compute1, ScaleA, EVTCompute0, Bias>;
static ArgumentType prepare_args(torch::Tensor const& a_scales,
torch::Tensor const& b_scales,
torch::Tensor const& bias);
};
template <typename ElementD, typename OutputTileThreadMap>
struct ScaledEpilogueBiasAzp : protected ScaledEpilogueBase<ElementD, OutputTileThreadMap> {
using EVTCompute = Sm80EVT<ComputeScaleBiasA, ScaleA, EVTComputeScaleB, Bias>;
static ArgumentType prepare_args(torch::Tensor const& a_scales,
torch::Tensor const& b_scales,
torch::Tensor const& azp_adj,
std::optional<torch::Tensor> const& bias);
};
template <typename ElementD, typename OutputTileThreadMap>
struct ScaledEpilogueBiasAzpToken : protected ScaledEpilogueBase<ElementD, OutputTileThreadMap> {
static ArgumentType prepare_args(torch::Tensor const& a_scales,
torch::Tensor const& b_scales,
torch::Tensor const& azp_adj,
torch::Tensor const& azp,
std::optional<torch::Tensor> const& bias);
};
} // namespace vllm::c2x
Import
#include "cutlass_extensions/epilogue/scaled_mm_epilogues_c2x.hpp"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| a_scales | torch::Tensor | Yes | Quantization scales for operand A; scalar (per-tensor) or column vector (per-row) |
| b_scales | torch::Tensor | Yes | Quantization scales for operand B; scalar (per-tensor) or row vector (per-column) |
| bias | torch::Tensor / optional | No | Per-output-channel bias tensor (row vector of shape 1xN) |
| azp_adj | torch::Tensor | No | Activation zero-point adjustment term of shape (1,N), computed as azp * J @ B |
| azp | torch::Tensor | No | Per-token activation zero-points of shape (M,1), used with AzpToken variant |
| ElementD | template param | Yes | Output element type (e.g., half, bfloat16) |
| OutputTileThreadMap | template param | Yes | CUTLASS thread map defining the output tile partitioning |
Outputs
| Name | Type | Description |
|---|---|---|
| EVTCompute::Arguments | ArgumentType | Fully constructed epilogue arguments struct ready for kernel launch |
Usage Examples
// ScaledEpilogue: D = (a_scales * A) @ (b_scales * B)
using Epilogue = vllm::c2x::ScaledEpilogue<cutlass::half_t, OutputTileThreadMap>;
auto args = Epilogue::prepare_args(a_scales_tensor, b_scales_tensor);
// ScaledEpilogueBias: D = (a_scales * A) @ (b_scales * B) + bias
using EpilogueBias = vllm::c2x::ScaledEpilogueBias<cutlass::half_t, OutputTileThreadMap>;
auto bias_args = EpilogueBias::prepare_args(a_scales, b_scales, bias);
// ScaledEpilogueBiasAzp: asymmetric quantization with per-tensor zero-point
using EpilogueAzp = vllm::c2x::ScaledEpilogueBiasAzp<cutlass::half_t, OutputTileThreadMap>;
auto azp_args = EpilogueAzp::prepare_args(a_scales, b_scales, azp_adj, bias);