Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:NVIDIA TransformerEngine CPU Offload

From Leeroopedia
Revision as of 15:57, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/NVIDIA_TransformerEngine_CPU_Offload.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Field Value
Sources TransformerEngine
Domains Deep_Learning, PyTorch, Optimization
Last Updated 2026-02-07 14:00 GMT

Overview

Implements GPU-to-CPU offloading of activation tensors saved for the backward pass, reducing GPU memory usage during training via asynchronous D2H/H2D transfers.

Description

Uses PyTorch's saved_tensors_hooks to intercept tensors saved during the forward pass. When get_cpu_offload_context is activated, an OffloadSynchronizer (either DefaultOffloadSynchronizer or ManualOffloadSynchronizer) manages asynchronous D2H/H2D transfers using CUDA streams and events. TensorGroup bundles tensors with synchronization events, and TensorGroupProcessor optimizes offload by deduplicating tensors and switching views to base tensors before transfer. start_offload marks tensors as ready for offload with CUDA events, while mark_not_offload prevents specific tensors from being offloaded. Supports the legacy V1 code path via NVTE_CPU_OFFLOAD_V1 env var for backward compatibility.

Usage

Enable when GPU memory is a bottleneck during training. Trades compute time for memory savings by asynchronously transferring activation tensors to CPU memory during forward pass and retrieving them during backward pass.

Code Reference

Source Location

Repository
NVIDIA/TransformerEngine
File
transformer_engine/pytorch/cpu_offload.py
Lines
1--915

Signature

def is_cpu_offload_enabled(): ...
def mark_activation_offload(*tensors): ...
def mark_not_offload(*tensors: torch.Tensor): ...
def start_offload(*tensors: torch.Tensor, offload_base_tensor: bool = False): ...
def get_cpu_offload_context(enabled: bool = False): ...

class TensorGroup: ...
class TensorGroupProcessor: ...
class OffloadSynchronizer: ...
class DefaultOffloadSynchronizer(OffloadSynchronizer): ...
class ManualOffloadSynchronizer(OffloadSynchronizer): ...

Import

from transformer_engine.pytorch.cpu_offload import (
    get_cpu_offload_context,
    is_cpu_offload_enabled,
    mark_activation_offload,
    start_offload,
)

I/O Contract

Inputs

Name Type Required Description
enabled bool Yes Whether to enable CPU offloading
tensors torch.Tensor No Tensors to mark for offloading
offload_base_tensor bool No Whether to offload base tensors of views

Outputs

Name Type Description
context contextmanager Context manager that enables offloading within its scope
synchronizer OffloadSynchronizer Object managing offload synchronization

Usage Examples

from transformer_engine.pytorch.cpu_offload import get_cpu_offload_context

# Enable CPU offloading for a training step
with get_cpu_offload_context(enabled=True):
    output = model(input_data)
    loss = loss_fn(output, target)
    loss.backward()

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment