Implementation:NVIDIA TransformerEngine PyTorch Quantizer Cpp
| Field | Value |
|---|---|
| Sources | TransformerEngine |
| Domains | Deep_Learning, PyTorch, Quantization |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
Implements all C++ quantizer class methods for creating, converting, and quantizing tensors across all supported low-precision formats (FP8 delayed, FP8 current scaling, FP8 block, MXFP8, NVFP4).
Description
Each quantizer class implements create_tensor (allocates data + scale_inv tensors, constructs Python tensor objects via pybind), convert_and_update_tensor (wraps existing Python tensor as TensorWrapper), set_quantization_params (sets scale/amax/scale_inv on TensorWrapper), and quantize (calls the appropriate NVTE cast kernel). Float8Quantizer handles delayed scaling with amax history and scale/scale_inv. Float8CurrentScalingQuantizer adds quantize_with_amax for fused amax computation. MXFP8Quantizer manages separate rowwise/columnwise E8M0 scale_inv tensors. NVFP4Quantizer handles FP4 data (half-byte packing), global scale factors, block-level scale factors, and optional Randomized Hadamard Transform (RHT) with post-RHT amax. Helper functions like make_transpose_shape and convert_shape_for_fp4 handle the unique shape transformations needed for each format.
Usage
The largest C++ implementation file (1706 lines) containing the complete quantization logic for every supported format. Core of TE's low-precision infrastructure.
Code Reference
Source Location
- Repository
NVIDIA/TransformerEngine- File
transformer_engine/pytorch/csrc/quantizer.cpp- Lines
- 1--1706
Signature
namespace transformer_engine::pytorch {
// Float8Quantizer methods
py::object Float8Quantizer::create_tensor(const std::vector<size_t>& shape, ...);
TensorWrapper Float8Quantizer::convert_and_update_tensor(py::handle tensor, ...);
void Float8Quantizer::set_quantization_params(TensorWrapper& tensor);
py::object Float8Quantizer::quantize(TensorWrapper input, ...);
// Float8CurrentScalingQuantizer methods
py::object Float8CurrentScalingQuantizer::quantize_with_amax(TensorWrapper input, ...);
// MXFP8Quantizer methods
py::object MXFP8Quantizer::create_tensor(const std::vector<size_t>& shape, ...);
// NVFP4Quantizer methods
py::object NVFP4Quantizer::create_tensor(const std::vector<size_t>& shape, ...);
py::object NVFP4Quantizer::quantize(TensorWrapper input, ...);
// Helpers
std::vector<size_t> make_transpose_shape(const std::vector<size_t>& shape);
std::vector<size_t> convert_shape_for_fp4(const std::vector<size_t>& shape, bool transpose);
}
Import
#include "common.h"
#include "pybind.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| input | TensorWrapper |
Yes | Input tensor to quantize |
| shape | std::vector<size_t> |
Yes | Shape for tensor creation |
| tensor | py::handle |
No | Existing Python tensor for conversion |
Outputs
| Name | Type | Description |
|---|---|---|
| quantized_tensor | py::object |
Python quantized tensor object (Float8Tensor, MXFP8Tensor, etc.) |
| tensor_wrapper | TensorWrapper |
C++ tensor wrapper for kernel dispatch |
Usage Examples
// Internal C++ usage (called from cast.cpp and other extensions)
auto quantizer = std::make_shared<Float8Quantizer>(scale, scale_inv, amax, fp8_dtype);
auto result = quantizer->quantize(input_wrapper);