Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Mlc ai Mlc llm FT Quantization

From Leeroopedia


Knowledge Sources
Domains Quantization, FasterTransformer, CUTLASS, Linear Layers
Last Updated 2026-02-09 19:00 GMT

Overview

FasterTransformer (FT) quantization configuration and quantized linear layer implementation for INT4/INT8 weight quantization using CUTLASS kernels in MLC LLM.

Description

The ft_quantization module implements the FasterTransformer-style weight quantization pipeline, which packs INT4 or INT8 quantized weights into a format compatible with NVIDIA CUTLASS kernels for efficient GPU inference.

Core Components:

  • FTQuantize (dataclass): The main quantization configuration. Supports INT4 and INT8 quantize dtypes with INT8 storage, float16 model dtype, and optional group sizes of 64 or 128. Key features:
    • quantize_model: Walks the model graph using nn.Mutator, replacing nn.Linear layers with FTQuantizeLinear. Layers that cannot use FT quantization (final FC layers, float32 output layers, or layers with incompatible dimensions) fall back to GroupQuantize with group size 32.
    • quantize_weight: Performs the actual weight quantization on CUDA. It creates a TVM Relax function that computes per-group scales, quantizes and packs weights, then calls cutlass.ft_preprocess_weight to reorder weights for the CUTLASS kernel layout. The compiled quantization functions are cached per weight shape.
    • _quantize: The underlying tensor expression that computes max absolute values per group, derives scales, quantizes weights to the target integer type, and bit-packs multiple quantized elements into a single storage element.
    • fallback_group_quantize: Creates a GroupQuantize configuration for layers that cannot use FT quantization.
  • FTQuantizeLinear: The quantized linear module that stores packed weights (q_weight) in transposed layout [k, ceildiv(n, num_elem_per_storage)] and per-group scales (q_scale) in shape [ceildiv(k, group_size), n]. The forward pass delegates to faster_transformer_dequantize_gemm which calls the CUTLASS kernel.

Quantization Flow:

  1. Per-group max absolute values are computed over the weight tensor
  2. Scale factors are derived as max_abs / max_int_value
  3. Weights are scaled, rounded, and clamped to the integer range
  4. Multiple quantized values are bit-packed into storage elements
  5. CUTLASS preprocessing reorders the packed weights for kernel compatibility

Usage

Use this module for deploying INT4 or INT8 quantized models with CUTLASS-based inference. The FT quantization requires CUTLASS to be enabled in the TVM runtime. It is particularly suited for models where CUTLASS kernel performance is preferred over other quantization approaches. When a layer is incompatible with FT quantization constraints, it automatically falls back to group quantization.

Code Reference

Source Location

Signature

@dataclass
class FTQuantize:
    name: str
    kind: str
    quantize_dtype: Literal["int4", "int8"]
    storage_dtype: Literal["int8"]
    model_dtype: Literal["float16"]
    group_size: Optional[int] = None  # None, 64, or 128

    def quantize_model(self, model: nn.Module, quant_map: QuantizeMapping, name_prefix: str) -> nn.Module
    def quantize_weight(self, weight: Tensor) -> List[Tensor]
    def fallback_group_quantize(self) -> GroupQuantize

class FTQuantizeLinear(nn.Module):
    def __init__(self, in_features, out_features, config: FTQuantize, bias=True, out_dtype=None)
    @staticmethod
    def from_linear(src: nn.Linear, config: FTQuantize) -> "FTQuantizeLinear"
    def forward(self, x: nn.Tensor) -> nn.Tensor

Import

from mlc_llm.quantization.ft_quantization import FTQuantize, FTQuantizeLinear

I/O Contract

FTQuantize Configuration

Field Type Constraints Description
name str -- Configuration name
kind str Must be "ft-quant" Quantization kind identifier
quantize_dtype str "int4" or "int8" Quantization precision
storage_dtype str "int8" Storage dtype for packed weights
model_dtype str "float16" only Model computation dtype
group_size Optional[int] None, 64, or 128 Per-channel group size (None = full channel)

FTQuantizeLinear Parameters

Parameter Shape Dtype Description
q_weight [in_features, ceildiv(out_features, num_elem_per_storage)] int8 Bit-packed quantized weights (transposed layout)
q_scale [ceildiv(in_features, group_size), out_features] float16 Per-group scale factors
bias [out_features] float16 or out_dtype Optional bias parameter

FTQuantizeLinear.forward

Parameter Type Description
x nn.Tensor Input activation tensor, dtype float16
Return Type Description
result nn.Tensor Output tensor after quantized GEMM + optional bias

quantize_weight

Parameter Type Description
weight Tensor Original float16 weight tensor of shape [n, k]
Return Type Description
[q_weight, q_scale] List[Tensor] Packed quantized weight and per-group scales

Fallback Conditions

Layers fall back to GroupQuantize when any of the following conditions are met:

Condition Reason
Final FC layer (lm_head) See issue #1723; direct quantization degrades performance
out_dtype == "float32" Incompatible with FT quantization output format
INT4 and out_features % 8 != 0 CUTLASS alignment requirement
INT8 and out_features % 4 != 0 CUTLASS alignment requirement

Usage Examples

from mlc_llm.quantization.ft_quantization import FTQuantize

# Define FT quantization config for INT4
ft_config = FTQuantize(
    name="ft_int4",
    kind="ft-quant",
    quantize_dtype="int4",
    storage_dtype="int8",
    model_dtype="float16",
    group_size=128,
)

# Quantize the model
quantized_model = ft_config.quantize_model(model, quant_map, name_prefix="model")

# The quantize_weight function is used during weight conversion
# and requires CUTLASS to be available in TVM runtime
q_weight, q_scale = ft_config.quantize_weight(original_weight)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment