Overview
FasterTransformer (FT) quantization configuration and quantized linear layer implementation for INT4/INT8 weight quantization using CUTLASS kernels in MLC LLM.
Description
The ft_quantization module implements the FasterTransformer-style weight quantization pipeline, which packs INT4 or INT8 quantized weights into a format compatible with NVIDIA CUTLASS kernels for efficient GPU inference.
Core Components:
- FTQuantize (dataclass): The main quantization configuration. Supports INT4 and INT8 quantize dtypes with INT8 storage, float16 model dtype, and optional group sizes of 64 or 128. Key features:
- quantize_model: Walks the model graph using
nn.Mutator, replacing nn.Linear layers with FTQuantizeLinear. Layers that cannot use FT quantization (final FC layers, float32 output layers, or layers with incompatible dimensions) fall back to GroupQuantize with group size 32.
- quantize_weight: Performs the actual weight quantization on CUDA. It creates a TVM Relax function that computes per-group scales, quantizes and packs weights, then calls
cutlass.ft_preprocess_weight to reorder weights for the CUTLASS kernel layout. The compiled quantization functions are cached per weight shape.
- _quantize: The underlying tensor expression that computes max absolute values per group, derives scales, quantizes weights to the target integer type, and bit-packs multiple quantized elements into a single storage element.
- fallback_group_quantize: Creates a
GroupQuantize configuration for layers that cannot use FT quantization.
- FTQuantizeLinear: The quantized linear module that stores packed weights (
q_weight) in transposed layout [k, ceildiv(n, num_elem_per_storage)] and per-group scales (q_scale) in shape [ceildiv(k, group_size), n]. The forward pass delegates to faster_transformer_dequantize_gemm which calls the CUTLASS kernel.
Quantization Flow:
- Per-group max absolute values are computed over the weight tensor
- Scale factors are derived as max_abs / max_int_value
- Weights are scaled, rounded, and clamped to the integer range
- Multiple quantized values are bit-packed into storage elements
- CUTLASS preprocessing reorders the packed weights for kernel compatibility
Usage
Use this module for deploying INT4 or INT8 quantized models with CUTLASS-based inference. The FT quantization requires CUTLASS to be enabled in the TVM runtime. It is particularly suited for models where CUTLASS kernel performance is preferred over other quantization approaches. When a layer is incompatible with FT quantization constraints, it automatically falls back to group quantization.
Code Reference
Source Location
Signature
@dataclass
class FTQuantize:
name: str
kind: str
quantize_dtype: Literal["int4", "int8"]
storage_dtype: Literal["int8"]
model_dtype: Literal["float16"]
group_size: Optional[int] = None # None, 64, or 128
def quantize_model(self, model: nn.Module, quant_map: QuantizeMapping, name_prefix: str) -> nn.Module
def quantize_weight(self, weight: Tensor) -> List[Tensor]
def fallback_group_quantize(self) -> GroupQuantize
class FTQuantizeLinear(nn.Module):
def __init__(self, in_features, out_features, config: FTQuantize, bias=True, out_dtype=None)
@staticmethod
def from_linear(src: nn.Linear, config: FTQuantize) -> "FTQuantizeLinear"
def forward(self, x: nn.Tensor) -> nn.Tensor
Import
from mlc_llm.quantization.ft_quantization import FTQuantize, FTQuantizeLinear
I/O Contract
FTQuantize Configuration
| Field |
Type |
Constraints |
Description
|
| name |
str |
-- |
Configuration name
|
| kind |
str |
Must be "ft-quant" |
Quantization kind identifier
|
| quantize_dtype |
str |
"int4" or "int8" |
Quantization precision
|
| storage_dtype |
str |
"int8" |
Storage dtype for packed weights
|
| model_dtype |
str |
"float16" only |
Model computation dtype
|
| group_size |
Optional[int] |
None, 64, or 128 |
Per-channel group size (None = full channel)
|
FTQuantizeLinear Parameters
| Parameter |
Shape |
Dtype |
Description
|
| q_weight |
[in_features, ceildiv(out_features, num_elem_per_storage)] |
int8 |
Bit-packed quantized weights (transposed layout)
|
| q_scale |
[ceildiv(in_features, group_size), out_features] |
float16 |
Per-group scale factors
|
| bias |
[out_features] |
float16 or out_dtype |
Optional bias parameter
|
FTQuantizeLinear.forward
| Parameter |
Type |
Description
|
| x |
nn.Tensor |
Input activation tensor, dtype float16
|
| Return |
Type |
Description
|
| result |
nn.Tensor |
Output tensor after quantized GEMM + optional bias
|
quantize_weight
| Parameter |
Type |
Description
|
| weight |
Tensor |
Original float16 weight tensor of shape [n, k]
|
| Return |
Type |
Description
|
| [q_weight, q_scale] |
List[Tensor] |
Packed quantized weight and per-group scales
|
Fallback Conditions
Layers fall back to GroupQuantize when any of the following conditions are met:
| Condition |
Reason
|
| Final FC layer (lm_head) |
See issue #1723; direct quantization degrades performance
|
| out_dtype == "float32" |
Incompatible with FT quantization output format
|
| INT4 and out_features % 8 != 0 |
CUTLASS alignment requirement
|
| INT8 and out_features % 4 != 0 |
CUTLASS alignment requirement
|
Usage Examples
from mlc_llm.quantization.ft_quantization import FTQuantize
# Define FT quantization config for INT4
ft_config = FTQuantize(
name="ft_int4",
kind="ft-quant",
quantize_dtype="int4",
storage_dtype="int8",
model_dtype="float16",
group_size=128,
)
# Quantize the model
quantized_model = ft_config.quantize_model(model, quant_map, name_prefix="model")
# The quantize_weight function is used during weight conversion
# and requires CUTLASS to be available in TVM runtime
q_weight, q_scale = ft_config.quantize_weight(original_weight)
Related Pages
Page Connections
Double-click a node to navigate. Hold to expand connections.