Implementation:NVIDIA TransformerEngine NVFP4Tensor

Field	Value
Sources	TransformerEngine
Domains	Deep_Learning, PyTorch, Quantization
Last Updated	2026-02-07 14:00 GMT

Overview

Implements NVFP4 (4-bit floating point E2M1) tensors with NVIDIA block scaling, supporting optional Random Hadamard Transform (RHT) preprocessing for improved quantization accuracy.

Description

NVFP4Quantizer supports RHT via a cached 16x16 Hadamard matrix with random sign masks applied before quantization, 2D block quantization for weights, stochastic rounding for gradients, and optional amax reduction across distributed groups. The quantizer constructs a full RHT matrix for pre-processing data before FP4 casting. Helper functions (get_hadamard_matrix, get_rht_matrix) build and cache the transform matrices. NVFP4Tensor stores quantized data at 4 bits per element (two values packed into one byte) with per-block scale factors and a global tensor-level scale.

Usage

Enables aggressive 4-bit quantization for even greater memory savings and compute throughput than FP8. Targets Blackwell GPUs. The RHT preprocessing redistributes outlier values to reduce quantization error.

Code Reference

Source Location

Repository: NVIDIA/TransformerEngine
File: transformer_engine/pytorch/tensor/nvfp4_tensor.py
Lines: 1--961

Signature

def get_hadamard_matrix(hadamard_dimension: int, device: int) -> torch.Tensor: ...
def get_rht_matrix(with_random_sign_mask: bool, device: int) -> torch.Tensor: ...

class NVFP4Quantizer(Quantizer):
    def __init__(self, fp8_dtype, block_dims=None, with_rht=False, ...): ...
    def quantize(self, tensor, ...): ...
    def set_usage(self, rowwise=False, columnwise=False): ...

class NVFP4Tensor(NVFP4TensorStorage, QuantizedTensor):
    def __init__(self, *, rowwise_data, columnwise_data, ...): ...
    def dequantize(self, dtype=None) -> torch.Tensor: ...
    @classmethod
    def __torch_dispatch__(cls, func, types, args, kwargs): ...

Import

from transformer_engine.pytorch.tensor.nvfp4_tensor import (
    NVFP4Quantizer,
    NVFP4Tensor,
)

I/O Contract

Inputs

Name	Type	Required	Description
tensor	`torch.Tensor`	Yes	High-precision tensor to quantize to FP4
fp8_dtype	`torch.dtype`	Yes	Target dtype for block scale factors
with_rht	`bool`	No	Whether to apply Random Hadamard Transform preprocessing
block_dims	`tuple`	No	Block dimensions for scaling (e.g., (1, 16) for activations, (16, 16) for weights)

Outputs

Name	Type	Description
nvfp4_tensor	`NVFP4Tensor`	4-bit quantized tensor with block scales and global scale

Usage Examples

from transformer_engine.pytorch.tensor.nvfp4_tensor import NVFP4Quantizer
import torch

# Create NVFP4 quantizer with RHT for better accuracy
quantizer = NVFP4Quantizer(
    fp8_dtype=torch.float8_e4m3fn,
    with_rht=True,
)
quantizer.set_usage(rowwise=True, columnwise=True)

nvfp4_tensor = quantizer.quantize(input_tensor)
output = nvfp4_tensor.dequantize(dtype=torch.bfloat16)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment