Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:NVIDIA TransformerEngine NVFP4Tensor

From Leeroopedia


Field Value
Sources TransformerEngine
Domains Deep_Learning, PyTorch, Quantization
Last Updated 2026-02-07 14:00 GMT

Overview

Implements NVFP4 (4-bit floating point E2M1) tensors with NVIDIA block scaling, supporting optional Random Hadamard Transform (RHT) preprocessing for improved quantization accuracy.

Description

NVFP4Quantizer supports RHT via a cached 16x16 Hadamard matrix with random sign masks applied before quantization, 2D block quantization for weights, stochastic rounding for gradients, and optional amax reduction across distributed groups. The quantizer constructs a full RHT matrix for pre-processing data before FP4 casting. Helper functions (get_hadamard_matrix, get_rht_matrix) build and cache the transform matrices. NVFP4Tensor stores quantized data at 4 bits per element (two values packed into one byte) with per-block scale factors and a global tensor-level scale.

Usage

Enables aggressive 4-bit quantization for even greater memory savings and compute throughput than FP8. Targets Blackwell GPUs. The RHT preprocessing redistributes outlier values to reduce quantization error.

Code Reference

Source Location

Repository
NVIDIA/TransformerEngine
File
transformer_engine/pytorch/tensor/nvfp4_tensor.py
Lines
1--961

Signature

def get_hadamard_matrix(hadamard_dimension: int, device: int) -> torch.Tensor: ...
def get_rht_matrix(with_random_sign_mask: bool, device: int) -> torch.Tensor: ...

class NVFP4Quantizer(Quantizer):
    def __init__(self, fp8_dtype, block_dims=None, with_rht=False, ...): ...
    def quantize(self, tensor, ...): ...
    def set_usage(self, rowwise=False, columnwise=False): ...

class NVFP4Tensor(NVFP4TensorStorage, QuantizedTensor):
    def __init__(self, *, rowwise_data, columnwise_data, ...): ...
    def dequantize(self, dtype=None) -> torch.Tensor: ...
    @classmethod
    def __torch_dispatch__(cls, func, types, args, kwargs): ...

Import

from transformer_engine.pytorch.tensor.nvfp4_tensor import (
    NVFP4Quantizer,
    NVFP4Tensor,
)

I/O Contract

Inputs

Name Type Required Description
tensor torch.Tensor Yes High-precision tensor to quantize to FP4
fp8_dtype torch.dtype Yes Target dtype for block scale factors
with_rht bool No Whether to apply Random Hadamard Transform preprocessing
block_dims tuple No Block dimensions for scaling (e.g., (1, 16) for activations, (16, 16) for weights)

Outputs

Name Type Description
nvfp4_tensor NVFP4Tensor 4-bit quantized tensor with block scales and global scale

Usage Examples

from transformer_engine.pytorch.tensor.nvfp4_tensor import NVFP4Quantizer
import torch

# Create NVFP4 quantizer with RHT for better accuracy
quantizer = NVFP4Quantizer(
    fp8_dtype=torch.float8_e4m3fn,
    with_rht=True,
)
quantizer.set_usage(rowwise=True, columnwise=True)

nvfp4_tensor = quantizer.quantize(input_tensor)
output = nvfp4_tensor.dequantize(dtype=torch.bfloat16)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment