Implementation:NVIDIA TransformerEngine CPU Offload
| Field | Value |
|---|---|
| Sources | TransformerEngine |
| Domains | Deep_Learning, PyTorch, Optimization |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
Implements GPU-to-CPU offloading of activation tensors saved for the backward pass, reducing GPU memory usage during training via asynchronous D2H/H2D transfers.
Description
Uses PyTorch's saved_tensors_hooks to intercept tensors saved during the forward pass. When get_cpu_offload_context is activated, an OffloadSynchronizer (either DefaultOffloadSynchronizer or ManualOffloadSynchronizer) manages asynchronous D2H/H2D transfers using CUDA streams and events. TensorGroup bundles tensors with synchronization events, and TensorGroupProcessor optimizes offload by deduplicating tensors and switching views to base tensors before transfer. start_offload marks tensors as ready for offload with CUDA events, while mark_not_offload prevents specific tensors from being offloaded. Supports the legacy V1 code path via NVTE_CPU_OFFLOAD_V1 env var for backward compatibility.
Usage
Enable when GPU memory is a bottleneck during training. Trades compute time for memory savings by asynchronously transferring activation tensors to CPU memory during forward pass and retrieving them during backward pass.
Code Reference
Source Location
- Repository
NVIDIA/TransformerEngine- File
transformer_engine/pytorch/cpu_offload.py- Lines
- 1--915
Signature
def is_cpu_offload_enabled(): ...
def mark_activation_offload(*tensors): ...
def mark_not_offload(*tensors: torch.Tensor): ...
def start_offload(*tensors: torch.Tensor, offload_base_tensor: bool = False): ...
def get_cpu_offload_context(enabled: bool = False): ...
class TensorGroup: ...
class TensorGroupProcessor: ...
class OffloadSynchronizer: ...
class DefaultOffloadSynchronizer(OffloadSynchronizer): ...
class ManualOffloadSynchronizer(OffloadSynchronizer): ...
Import
from transformer_engine.pytorch.cpu_offload import (
get_cpu_offload_context,
is_cpu_offload_enabled,
mark_activation_offload,
start_offload,
)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| enabled | bool |
Yes | Whether to enable CPU offloading |
| tensors | torch.Tensor |
No | Tensors to mark for offloading |
| offload_base_tensor | bool |
No | Whether to offload base tensors of views |
Outputs
| Name | Type | Description |
|---|---|---|
| context | contextmanager |
Context manager that enables offloading within its scope |
| synchronizer | OffloadSynchronizer |
Object managing offload synchronization |
Usage Examples
from transformer_engine.pytorch.cpu_offload import get_cpu_offload_context
# Enable CPU offloading for a training step
with get_cpu_offload_context(enabled=True):
output = model(input_data)
loss = loss_fn(output, target)
loss.backward()