Implementation:NVIDIA TransformerEngine CUDA Graph

Field	Value
Sources	TransformerEngine
Domains	Deep_Learning, PyTorch, Optimization
Last Updated	2026-02-07 14:00 GMT

Overview

Provides CUDA Graph capture support for TransformerEngine modules with FP8-aware graph recording and replay for reduced kernel launch overhead.

Description

make_graphed_callables is the main API that captures PyTorch modules into CUDA graphs for reduced kernel launch overhead. It performs warmup iterations, captures forward and backward passes into torch.cuda.CUDAGraph objects, and returns Graphed wrapper callables that replay the recorded graphs. Special handling is needed for FP8: save_fp8_tensors and restore_fp8_tensors preserve FP8 scaling metadata (amax history, scale factors) across graph captures. The implementation supports interleaved capture order (for pipeline parallelism), delayed weight gradient computation, autocast contexts, and graph-safe RNG state management. _graph_context_wrapper works around a PyTorch bug where garbage collection during graph capture causes CUDA errors. The _IS_GRAPH_CAPTURING flag lets other modules detect when they are being captured.

Usage

Use to eliminate per-iteration CPU overhead from kernel launches, which is particularly impactful for small-batch inference and short-sequence training where launch overhead dominates.

Code Reference

Source Location

Repository: NVIDIA/TransformerEngine
File: transformer_engine/pytorch/graph.py
Lines: 1--1267

Signature

def set_capture_start() -> None: ...
def set_capture_end() -> None: ...
def is_graph_capturing() -> bool: ...
def graph_pool_handle(): ...
def save_fp8_tensors(modules, amax_history): ...
def restore_fp8_tensors(modules, amax_history): ...

def make_graphed_callables(
    modules, sample_args, num_warmup_iters=3,
    allow_unused_input=False, sample_kwargs=None, ...
): ...

class Graphed: ...

Import

from transformer_engine.pytorch.graph import (
    make_graphed_callables,
    is_graph_capturing,
    save_fp8_tensors,
    restore_fp8_tensors,
)

I/O Contract

Inputs

Name	Type	Required	Description
modules	`Union[nn.Module, tuple]`	Yes	Module(s) to capture into CUDA graphs
sample_args	`tuple`	Yes	Sample input arguments for graph capture
num_warmup_iters	`int`	No	Number of warmup iterations before capture (default 3)
allow_unused_input	`bool`	No	Whether to allow unused inputs in backward
sample_kwargs	`dict`	No	Sample keyword arguments for graph capture

Outputs

Name	Type	Description
graphed_callables	`Graphed` or `tuple[Graphed]`	Callable(s) that replay the captured CUDA graphs

Usage Examples

from transformer_engine.pytorch.graph import make_graphed_callables

# Capture a TE module into a CUDA graph
graphed_model = make_graphed_callables(
    te_model,
    sample_args=(sample_input,),
    num_warmup_iters=3,
)

# Use the graphed callable (replays CUDA graph)
output = graphed_model(input_tensor)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment