Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:NVIDIA TransformerEngine Cast C API

From Leeroopedia


Field Value
Sources TransformerEngine
Domains Deep_Learning, Optimization
Last Updated 2026-02-07 14:00 GMT

Overview

Declares the C API for quantizing (casting) tensors to FP8, MXFP8, and blockwise FP8 formats, including fused operations that combine quantization with bias-gradient reduction and activation backward passes.

Description

cast.h provides the casting/quantization API central to TransformerEngine's FP8 training pipeline. The output tensor's NVTEScalingMode determines the quantization strategy:

  • Per-tensor delayed scaling: (NVTE_DELAYED_TENSOR_SCALING) Single scale factor for the entire tensor, using precalculated amax.
  • MXFP8 1D block scaling: (NVTE_MXFP8_1D_SCALING) One scale factor per block of 32 elements along rows (1x32) or columns (32x1).
  • FP8 block scaling: (NVTE_BLOCK_SCALING_1D / NVTE_BLOCK_SCALING_2D) Scale factors per 1x128 or 128x128 tiles.

Fused operations reduce memory bandwidth by combining multiple backward-pass operations:

  • nvte_quantize_dbias: Quantize + column reduction for bias gradient
  • nvte_quantize_dbias_dgelu/dsilu/drelu/dqgelu/dsrelu: Quantize + dbias + activation backward

Usage

Use at every layer boundary in the FP8 training pipeline for precision casting between high-precision and FP8 representations.

Code Reference

Source Location

Repository
NVIDIA/TransformerEngine
File
transformer_engine/common/include/transformer_engine/cast.h
Lines
1--438

Signature

void nvte_quantize(const NVTETensor input, NVTETensor output, cudaStream_t stream);
void nvte_group_quantize(const NVTEGroupedTensor input, NVTEGroupedTensor output,
                         cudaStream_t stream);
void nvte_quantize_noop(const NVTETensor input, NVTETensor output,
                        NVTETensor noop, cudaStream_t stream);
void nvte_quantize_v2(const NVTETensor input, NVTETensor output,
                      const NVTEQuantizationConfig quant_config, cudaStream_t stream);
void nvte_quantize_dbias(const NVTETensor input, NVTETensor output,
                         NVTETensor dbias, NVTETensor workspace, cudaStream_t stream);
void nvte_quantize_dbias_dgelu(const NVTETensor input, const NVTETensor act_input,
                               NVTETensor output, NVTETensor dbias,
                               NVTETensor workspace, cudaStream_t stream);

Import

#include "transformer_engine/cast.h"

I/O Contract

Inputs

Name Type Required Description
input NVTETensor Yes Input tensor to be quantized
stream cudaStream_t Yes CUDA stream for the operation

Outputs

Name Type Description
output NVTETensor Quantized output tensor (FP8/MXFP8/blockwise FP8)
dbias NVTETensor Bias gradient from column reduction (fused variants)

Usage Examples

#include "transformer_engine/cast.h"

// Basic quantization - scaling mode determined by output tensor
nvte_quantize(input, fp8_output, stream);

// Fused quantize + bias gradient + GeLU backward
nvte_quantize_dbias_dgelu(grad_output, gelu_input, fp8_output,
                          dbias, workspace, stream);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment