Implementation:Mlc ai Mlc llm FT Quantization

Knowledge Sources	Mlc_ai_Mlc_llm
Domains	Quantization, FasterTransformer, CUTLASS, Linear Layers
Last Updated	2026-02-09 19:00 GMT

Overview

FasterTransformer (FT) quantization configuration and quantized linear layer implementation for INT4/INT8 weight quantization using CUTLASS kernels in MLC LLM.

Description

The ft_quantization module implements the FasterTransformer-style weight quantization pipeline, which packs INT4 or INT8 quantized weights into a format compatible with NVIDIA CUTLASS kernels for efficient GPU inference.

Core Components:

FTQuantize (dataclass): The main quantization configuration. Supports INT4 and INT8 quantize dtypes with INT8 storage, float16 model dtype, and optional group sizes of 64 or 128. Key features:
- quantize_model: Walks the model graph using nn.Mutator, replacing nn.Linear layers with FTQuantizeLinear. Layers that cannot use FT quantization (final FC layers, float32 output layers, or layers with incompatible dimensions) fall back to GroupQuantize with group size 32.
- quantize_weight: Performs the actual weight quantization on CUDA. It creates a TVM Relax function that computes per-group scales, quantizes and packs weights, then calls cutlass.ft_preprocess_weight to reorder weights for the CUTLASS kernel layout. The compiled quantization functions are cached per weight shape.
- _quantize: The underlying tensor expression that computes max absolute values per group, derives scales, quantizes weights to the target integer type, and bit-packs multiple quantized elements into a single storage element.
- fallback_group_quantize: Creates a GroupQuantize configuration for layers that cannot use FT quantization.

FTQuantizeLinear: The quantized linear module that stores packed weights (q_weight) in transposed layout [k, ceildiv(n, num_elem_per_storage)] and per-group scales (q_scale) in shape [ceildiv(k, group_size), n]. The forward pass delegates to faster_transformer_dequantize_gemm which calls the CUTLASS kernel.

Quantization Flow:

Per-group max absolute values are computed over the weight tensor
Scale factors are derived as max_abs / max_int_value
Weights are scaled, rounded, and clamped to the integer range
Multiple quantized values are bit-packed into storage elements
CUTLASS preprocessing reorders the packed weights for kernel compatibility

Usage

Use this module for deploying INT4 or INT8 quantized models with CUTLASS-based inference. The FT quantization requires CUTLASS to be enabled in the TVM runtime. It is particularly suited for models where CUTLASS kernel performance is preferred over other quantization approaches. When a layer is incompatible with FT quantization constraints, it automatically falls back to group quantization.

Code Reference

Source Location

Repository: Mlc_ai_Mlc_llm
File: python/mlc_llm/quantization/ft_quantization.py

Signature

@dataclass
class FTQuantize:
    name: str
    kind: str
    quantize_dtype: Literal["int4", "int8"]
    storage_dtype: Literal["int8"]
    model_dtype: Literal["float16"]
    group_size: Optional[int] = None  # None, 64, or 128

    def quantize_model(self, model: nn.Module, quant_map: QuantizeMapping, name_prefix: str) -> nn.Module
    def quantize_weight(self, weight: Tensor) -> List[Tensor]
    def fallback_group_quantize(self) -> GroupQuantize

class FTQuantizeLinear(nn.Module):
    def __init__(self, in_features, out_features, config: FTQuantize, bias=True, out_dtype=None)
    @staticmethod
    def from_linear(src: nn.Linear, config: FTQuantize) -> "FTQuantizeLinear"
    def forward(self, x: nn.Tensor) -> nn.Tensor

Import

from mlc_llm.quantization.ft_quantization import FTQuantize, FTQuantizeLinear

I/O Contract

FTQuantize Configuration

Field	Type	Constraints	Description
name	str	--	Configuration name
kind	str	Must be "ft-quant"	Quantization kind identifier
quantize_dtype	str	"int4" or "int8"	Quantization precision
storage_dtype	str	"int8"	Storage dtype for packed weights
model_dtype	str	"float16" only	Model computation dtype
group_size	Optional[int]	None, 64, or 128	Per-channel group size (None = full channel)

FTQuantizeLinear Parameters

Parameter	Shape	Dtype	Description
q_weight	[in_features, ceildiv(out_features, num_elem_per_storage)]	int8	Bit-packed quantized weights (transposed layout)
q_scale	[ceildiv(in_features, group_size), out_features]	float16	Per-group scale factors
bias	[out_features]	float16 or out_dtype	Optional bias parameter

FTQuantizeLinear.forward

Parameter	Type	Description
x	nn.Tensor	Input activation tensor, dtype float16

Return	Type	Description
result	nn.Tensor	Output tensor after quantized GEMM + optional bias

quantize_weight

Parameter	Type	Description
weight	Tensor	Original float16 weight tensor of shape [n, k]

Return	Type	Description
[q_weight, q_scale]	List[Tensor]	Packed quantized weight and per-group scales

Fallback Conditions

Layers fall back to GroupQuantize when any of the following conditions are met:

Condition	Reason
Final FC layer (lm_head)	See issue #1723; direct quantization degrades performance
out_dtype == "float32"	Incompatible with FT quantization output format
INT4 and out_features % 8 != 0	CUTLASS alignment requirement
INT8 and out_features % 4 != 0	CUTLASS alignment requirement

Usage Examples

from mlc_llm.quantization.ft_quantization import FTQuantize

# Define FT quantization config for INT4
ft_config = FTQuantize(
    name="ft_int4",
    kind="ft-quant",
    quantize_dtype="int4",
    storage_dtype="int8",
    model_dtype="float16",
    group_size=128,
)

# Quantize the model
quantized_model = ft_config.quantize_model(model, quant_map, name_prefix="model")

# The quantize_weight function is used during weight conversion
# and requires CUTLASS to be available in TVM runtime
q_weight, q_scale = ft_config.quantize_weight(original_weight)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment