Implementation:NVIDIA TransformerEngine NVFP4 Storage

Field	Value
Sources	TransformerEngine
Domains	Deep_Learning, PyTorch, Quantization
Last Updated	2026-02-07 14:00 GMT

Overview

Mixin storage class holding raw NVFP4 (4-bit) quantized data with block scale factors and amax values for both rowwise and columnwise orientations.

Description

Stores _rowwise_data/_columnwise_data (packed FP4 values, two per byte), _rowwise_scale_inv/_columnwise_scale_inv (block scale factors), _amax_rowwise/_amax_columnwise (per-tensor absolute maximums used to compute tensor-level scales), and a swizzled scales flag. Includes _fp4_e2m1_vals() which returns the 16 representable FP4 E2M1 values (0, 0.5, 1, 1.5, 2, 3, 4, 6 and negatives). _FromNVFP4Func is a custom autograd function that handles dequantization via tex.dequantize. The dual-level scaling hierarchy (global tensor scale + per-block fine-grained scales) helps maintain accuracy at extremely low precision.

Usage

Data layer for 4-bit quantization with NVFP4 format. Used as a mixin base class for NVFP4Tensor.

Code Reference

Source Location

Repository: NVIDIA/TransformerEngine
File: transformer_engine/pytorch/tensor/storage/nvfp4_tensor_storage.py
Lines: 1--338

Signature

def _fp4_e2m1_vals(device: torch.device, dtype: torch.dtype) -> torch.Tensor: ...

class _FromNVFP4Func(torch.autograd.Function):
    @staticmethod
    def forward(ctx, tensor_storage, dtype): ...
    @staticmethod
    def backward(ctx, grad): ...

class NVFP4TensorStorage(QuantizedTensorStorage):
    _rowwise_data: Optional[torch.Tensor]
    _columnwise_data: Optional[torch.Tensor]
    _rowwise_scale_inv: Optional[torch.Tensor]
    _columnwise_scale_inv: Optional[torch.Tensor]
    _amax_rowwise: Optional[torch.Tensor]
    _amax_columnwise: Optional[torch.Tensor]

    def get_metadata(self) -> dict: ...
    def prepare_for_saving(self) -> list: ...
    def restore_from_saved(self, tensors) -> None: ...
    def clear(self) -> None: ...

Import

from transformer_engine.pytorch.tensor.storage.nvfp4_tensor_storage import (
    NVFP4TensorStorage,
)

I/O Contract

Inputs

Name	Type	Required	Description
rowwise_data	`torch.Tensor`	No	Packed FP4 rowwise data (two values per byte)
columnwise_data	`torch.Tensor`	No	Packed FP4 columnwise data
rowwise_scale_inv	`torch.Tensor`	No	Per-block inverse scale factors (rowwise)
columnwise_scale_inv	`torch.Tensor`	No	Per-block inverse scale factors (columnwise)
amax_rowwise	`torch.Tensor`	No	Per-tensor absolute max for global scale computation

Outputs

Name	Type	Description
dequantized	`torch.Tensor`	High-precision tensor reconstructed from FP4 data

Usage Examples

# NVFP4TensorStorage is used as a mixin base class
from transformer_engine.pytorch.tensor.nvfp4_tensor import NVFP4Tensor

# Access storage attributes through the tensor
nvfp4_tensor = quantizer.quantize(input_tensor)
row_data = nvfp4_tensor._rowwise_data    # packed FP4 uint8
row_scales = nvfp4_tensor._rowwise_scale_inv
amax = nvfp4_tensor._amax_rowwise

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment