Implementation:NVIDIA TransformerEngine NVFP4Tensor
| Field | Value |
|---|---|
| Sources | TransformerEngine |
| Domains | Deep_Learning, PyTorch, Quantization |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
Implements NVFP4 (4-bit floating point E2M1) tensors with NVIDIA block scaling, supporting optional Random Hadamard Transform (RHT) preprocessing for improved quantization accuracy.
Description
NVFP4Quantizer supports RHT via a cached 16x16 Hadamard matrix with random sign masks applied before quantization, 2D block quantization for weights, stochastic rounding for gradients, and optional amax reduction across distributed groups. The quantizer constructs a full RHT matrix for pre-processing data before FP4 casting. Helper functions (get_hadamard_matrix, get_rht_matrix) build and cache the transform matrices. NVFP4Tensor stores quantized data at 4 bits per element (two values packed into one byte) with per-block scale factors and a global tensor-level scale.
Usage
Enables aggressive 4-bit quantization for even greater memory savings and compute throughput than FP8. Targets Blackwell GPUs. The RHT preprocessing redistributes outlier values to reduce quantization error.
Code Reference
Source Location
- Repository
NVIDIA/TransformerEngine- File
transformer_engine/pytorch/tensor/nvfp4_tensor.py- Lines
- 1--961
Signature
def get_hadamard_matrix(hadamard_dimension: int, device: int) -> torch.Tensor: ...
def get_rht_matrix(with_random_sign_mask: bool, device: int) -> torch.Tensor: ...
class NVFP4Quantizer(Quantizer):
def __init__(self, fp8_dtype, block_dims=None, with_rht=False, ...): ...
def quantize(self, tensor, ...): ...
def set_usage(self, rowwise=False, columnwise=False): ...
class NVFP4Tensor(NVFP4TensorStorage, QuantizedTensor):
def __init__(self, *, rowwise_data, columnwise_data, ...): ...
def dequantize(self, dtype=None) -> torch.Tensor: ...
@classmethod
def __torch_dispatch__(cls, func, types, args, kwargs): ...
Import
from transformer_engine.pytorch.tensor.nvfp4_tensor import (
NVFP4Quantizer,
NVFP4Tensor,
)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| tensor | torch.Tensor |
Yes | High-precision tensor to quantize to FP4 |
| fp8_dtype | torch.dtype |
Yes | Target dtype for block scale factors |
| with_rht | bool |
No | Whether to apply Random Hadamard Transform preprocessing |
| block_dims | tuple |
No | Block dimensions for scaling (e.g., (1, 16) for activations, (16, 16) for weights) |
Outputs
| Name | Type | Description |
|---|---|---|
| nvfp4_tensor | NVFP4Tensor |
4-bit quantized tensor with block scales and global scale |
Usage Examples
from transformer_engine.pytorch.tensor.nvfp4_tensor import NVFP4Quantizer
import torch
# Create NVFP4 quantizer with RHT for better accuracy
quantizer = NVFP4Quantizer(
fp8_dtype=torch.float8_e4m3fn,
with_rht=True,
)
quantizer.set_usage(rowwise=True, columnwise=True)
nvfp4_tensor = quantizer.quantize(input_tensor)
output = nvfp4_tensor.dequantize(dtype=torch.bfloat16)