Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Vllm project Vllm Scaled MM Epilogues C2X

From Leeroopedia


Knowledge Sources
Domains CUTLASS, Epilogue, Quantization, GEMM
Last Updated 2026-02-08 00:00 GMT

Overview

Defines complete CUTLASS 2.x epilogue configurations for scaled and quantized matrix multiplications on Ampere (SM80) GPUs, supporting symmetric and asymmetric quantization with optional bias and activation zero-points.

Description

This header composes broadcast load visitors from broadcast_load_epilogue_c2x.hpp with CUTLASS epilogue compute nodes (Sm80EVT) to implement dequantization formulas directly within the GEMM epilogue, avoiding separate post-processing kernels. It supports four epilogue variants: basic scaled (ScaledEpilogue), scaled with bias (ScaledEpilogueBias), scaled with bias and per-tensor activation zero-point (ScaledEpilogueBiasAzp), and scaled with bias and per-token activation zero-point (ScaledEpilogueBiasAzpToken). Each class exposes an EVTCompute type and a static prepare_args method for constructing arguments from PyTorch tensors.

Usage

This header is included during compilation of CUTLASS-based quantized GEMM kernels for Ampere GPUs. It is used by the vLLM cutlass_scaled_mm and cutlass_scaled_mm_azp operations when running on SM80 hardware.

Code Reference

Source Location

Signature

namespace vllm::c2x {

template <typename ElementD, typename OutputTileThreadMap>
struct ScaledEpilogueBase { /* common load descriptors */ };

template <typename ElementD, typename OutputTileThreadMap>
struct ScaledEpilogue : private ScaledEpilogueBase<ElementD, OutputTileThreadMap> {
  using EVTCompute = Sm80EVT<Compute1, ScaleA, EVTCompute0>;
  static ArgumentType prepare_args(torch::Tensor const& a_scales,
                                   torch::Tensor const& b_scales);
};

template <typename ElementD, typename OutputTileThreadMap>
struct ScaledEpilogueBias : protected ScaledEpilogueBase<ElementD, OutputTileThreadMap> {
  using EVTCompute = Sm80EVT<Compute1, ScaleA, EVTCompute0, Bias>;
  static ArgumentType prepare_args(torch::Tensor const& a_scales,
                                   torch::Tensor const& b_scales,
                                   torch::Tensor const& bias);
};

template <typename ElementD, typename OutputTileThreadMap>
struct ScaledEpilogueBiasAzp : protected ScaledEpilogueBase<ElementD, OutputTileThreadMap> {
  using EVTCompute = Sm80EVT<ComputeScaleBiasA, ScaleA, EVTComputeScaleB, Bias>;
  static ArgumentType prepare_args(torch::Tensor const& a_scales,
                                   torch::Tensor const& b_scales,
                                   torch::Tensor const& azp_adj,
                                   std::optional<torch::Tensor> const& bias);
};

template <typename ElementD, typename OutputTileThreadMap>
struct ScaledEpilogueBiasAzpToken : protected ScaledEpilogueBase<ElementD, OutputTileThreadMap> {
  static ArgumentType prepare_args(torch::Tensor const& a_scales,
                                   torch::Tensor const& b_scales,
                                   torch::Tensor const& azp_adj,
                                   torch::Tensor const& azp,
                                   std::optional<torch::Tensor> const& bias);
};

} // namespace vllm::c2x

Import

#include "cutlass_extensions/epilogue/scaled_mm_epilogues_c2x.hpp"

I/O Contract

Inputs

Name Type Required Description
a_scales torch::Tensor Yes Quantization scales for operand A; scalar (per-tensor) or column vector (per-row)
b_scales torch::Tensor Yes Quantization scales for operand B; scalar (per-tensor) or row vector (per-column)
bias torch::Tensor / optional No Per-output-channel bias tensor (row vector of shape 1xN)
azp_adj torch::Tensor No Activation zero-point adjustment term of shape (1,N), computed as azp * J @ B
azp torch::Tensor No Per-token activation zero-points of shape (M,1), used with AzpToken variant
ElementD template param Yes Output element type (e.g., half, bfloat16)
OutputTileThreadMap template param Yes CUTLASS thread map defining the output tile partitioning

Outputs

Name Type Description
EVTCompute::Arguments ArgumentType Fully constructed epilogue arguments struct ready for kernel launch

Usage Examples

// ScaledEpilogue: D = (a_scales * A) @ (b_scales * B)
using Epilogue = vllm::c2x::ScaledEpilogue<cutlass::half_t, OutputTileThreadMap>;
auto args = Epilogue::prepare_args(a_scales_tensor, b_scales_tensor);

// ScaledEpilogueBias: D = (a_scales * A) @ (b_scales * B) + bias
using EpilogueBias = vllm::c2x::ScaledEpilogueBias<cutlass::half_t, OutputTileThreadMap>;
auto bias_args = EpilogueBias::prepare_args(a_scales, b_scales, bias);

// ScaledEpilogueBiasAzp: asymmetric quantization with per-tensor zero-point
using EpilogueAzp = vllm::c2x::ScaledEpilogueBiasAzp<cutlass::half_t, OutputTileThreadMap>;
auto azp_args = EpilogueAzp::prepare_args(a_scales, b_scales, azp_adj, bias);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment