Implementation:NVIDIA TransformerEngine NVFP4 Storage
| Field | Value |
|---|---|
| Sources | TransformerEngine |
| Domains | Deep_Learning, PyTorch, Quantization |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
Mixin storage class holding raw NVFP4 (4-bit) quantized data with block scale factors and amax values for both rowwise and columnwise orientations.
Description
Stores _rowwise_data/_columnwise_data (packed FP4 values, two per byte), _rowwise_scale_inv/_columnwise_scale_inv (block scale factors), _amax_rowwise/_amax_columnwise (per-tensor absolute maximums used to compute tensor-level scales), and a swizzled scales flag. Includes _fp4_e2m1_vals() which returns the 16 representable FP4 E2M1 values (0, 0.5, 1, 1.5, 2, 3, 4, 6 and negatives). _FromNVFP4Func is a custom autograd function that handles dequantization via tex.dequantize. The dual-level scaling hierarchy (global tensor scale + per-block fine-grained scales) helps maintain accuracy at extremely low precision.
Usage
Data layer for 4-bit quantization with NVFP4 format. Used as a mixin base class for NVFP4Tensor.
Code Reference
Source Location
- Repository
NVIDIA/TransformerEngine- File
transformer_engine/pytorch/tensor/storage/nvfp4_tensor_storage.py- Lines
- 1--338
Signature
def _fp4_e2m1_vals(device: torch.device, dtype: torch.dtype) -> torch.Tensor: ...
class _FromNVFP4Func(torch.autograd.Function):
@staticmethod
def forward(ctx, tensor_storage, dtype): ...
@staticmethod
def backward(ctx, grad): ...
class NVFP4TensorStorage(QuantizedTensorStorage):
_rowwise_data: Optional[torch.Tensor]
_columnwise_data: Optional[torch.Tensor]
_rowwise_scale_inv: Optional[torch.Tensor]
_columnwise_scale_inv: Optional[torch.Tensor]
_amax_rowwise: Optional[torch.Tensor]
_amax_columnwise: Optional[torch.Tensor]
def get_metadata(self) -> dict: ...
def prepare_for_saving(self) -> list: ...
def restore_from_saved(self, tensors) -> None: ...
def clear(self) -> None: ...
Import
from transformer_engine.pytorch.tensor.storage.nvfp4_tensor_storage import (
NVFP4TensorStorage,
)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| rowwise_data | torch.Tensor |
No | Packed FP4 rowwise data (two values per byte) |
| columnwise_data | torch.Tensor |
No | Packed FP4 columnwise data |
| rowwise_scale_inv | torch.Tensor |
No | Per-block inverse scale factors (rowwise) |
| columnwise_scale_inv | torch.Tensor |
No | Per-block inverse scale factors (columnwise) |
| amax_rowwise | torch.Tensor |
No | Per-tensor absolute max for global scale computation |
Outputs
| Name | Type | Description |
|---|---|---|
| dequantized | torch.Tensor |
High-precision tensor reconstructed from FP4 data |
Usage Examples
# NVFP4TensorStorage is used as a mixin base class
from transformer_engine.pytorch.tensor.nvfp4_tensor import NVFP4Tensor
# Access storage attributes through the tensor
nvfp4_tensor = quantizer.quantize(input_tensor)
row_data = nvfp4_tensor._rowwise_data # packed FP4 uint8
row_scales = nvfp4_tensor._rowwise_scale_inv
amax = nvfp4_tensor._amax_rowwise