Implementation:NVIDIA TransformerEngine Cast C API
| Field | Value |
|---|---|
| Sources | TransformerEngine |
| Domains | Deep_Learning, Optimization |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
Declares the C API for quantizing (casting) tensors to FP8, MXFP8, and blockwise FP8 formats, including fused operations that combine quantization with bias-gradient reduction and activation backward passes.
Description
cast.h provides the casting/quantization API central to TransformerEngine's FP8 training pipeline. The output tensor's NVTEScalingMode determines the quantization strategy:
- Per-tensor delayed scaling: (
NVTE_DELAYED_TENSOR_SCALING) Single scale factor for the entire tensor, using precalculated amax. - MXFP8 1D block scaling: (
NVTE_MXFP8_1D_SCALING) One scale factor per block of 32 elements along rows (1x32) or columns (32x1). - FP8 block scaling: (
NVTE_BLOCK_SCALING_1D/NVTE_BLOCK_SCALING_2D) Scale factors per 1x128 or 128x128 tiles.
Fused operations reduce memory bandwidth by combining multiple backward-pass operations:
nvte_quantize_dbias: Quantize + column reduction for bias gradientnvte_quantize_dbias_dgelu/dsilu/drelu/dqgelu/dsrelu: Quantize + dbias + activation backward
Usage
Use at every layer boundary in the FP8 training pipeline for precision casting between high-precision and FP8 representations.
Code Reference
Source Location
- Repository
NVIDIA/TransformerEngine- File
transformer_engine/common/include/transformer_engine/cast.h- Lines
- 1--438
Signature
void nvte_quantize(const NVTETensor input, NVTETensor output, cudaStream_t stream);
void nvte_group_quantize(const NVTEGroupedTensor input, NVTEGroupedTensor output,
cudaStream_t stream);
void nvte_quantize_noop(const NVTETensor input, NVTETensor output,
NVTETensor noop, cudaStream_t stream);
void nvte_quantize_v2(const NVTETensor input, NVTETensor output,
const NVTEQuantizationConfig quant_config, cudaStream_t stream);
void nvte_quantize_dbias(const NVTETensor input, NVTETensor output,
NVTETensor dbias, NVTETensor workspace, cudaStream_t stream);
void nvte_quantize_dbias_dgelu(const NVTETensor input, const NVTETensor act_input,
NVTETensor output, NVTETensor dbias,
NVTETensor workspace, cudaStream_t stream);
Import
#include "transformer_engine/cast.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
input |
NVTETensor |
Yes | Input tensor to be quantized |
stream |
cudaStream_t |
Yes | CUDA stream for the operation |
Outputs
| Name | Type | Description |
|---|---|---|
output |
NVTETensor |
Quantized output tensor (FP8/MXFP8/blockwise FP8) |
dbias |
NVTETensor |
Bias gradient from column reduction (fused variants) |
Usage Examples
#include "transformer_engine/cast.h"
// Basic quantization - scaling mode determined by output tensor
nvte_quantize(input, fp8_output, stream);
// Fused quantize + bias gradient + GeLU backward
nvte_quantize_dbias_dgelu(grad_output, gelu_input, fp8_output,
dbias, workspace, stream);