Implementation:NVIDIA TransformerEngine CPU Offload V1
| Field | Value |
|---|---|
| Sources | TransformerEngine |
| Domains | Deep_Learning, PyTorch, Optimization |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
Legacy (V1) implementation of CPU offloading for activation tensors during training, using an offload handler abstraction with synchronous and async double-buffered strategies.
Description
CpuOffloadSavedTensorHook is a context manager that pushes custom pack/unpack hooks via PyTorch's internal _push_saved_tensors_default_hooks. CpuOffloadHookWithOffloadHandler extends this to delegate actual offload/reload to an OffloadHandler. SynchronizedGroupOffloadHandler performs synchronous D2H/H2D copies in groups. AsyncDoubleBufferGroupOffloadHandler uses double buffering with separate CUDA streams for overlapping data transfers with computation. GroupCommitFunction is a custom autograd Function that triggers group commits during the forward pass. The mark_activation_offload function tags tensors and their data sub-tensors for offloading, including a needs_force_clear flag for QuantizedTensorStorage.
Usage
Retained for backward compatibility via the NVTE_CPU_OFFLOAD_V1 environment variable. The newer cpu_offload.py is the default path.
Code Reference
Source Location
- Repository
NVIDIA/TransformerEngine- File
transformer_engine/pytorch/cpu_offload_v1.py- Lines
- 1--743
Signature
def mark_activation_offload(*tensors): ...
def is_cpu_offload_enabled() -> bool: ...
def is_current_layer_offloaded() -> bool: ...
def get_cpu_offload_context(enabled: bool = False): ...
class CpuOffloadSavedTensorHook: ...
class CpuOffloadHookWithOffloadHandler(CpuOffloadSavedTensorHook): ...
class OffloadHandler: ...
class GroupCommitFunction(torch.autograd.Function): ...
class SynchronizedGroupOffloadHandler(OffloadHandler): ...
class AsyncDoubleBufferGroupOffloadHandler(SynchronizedGroupOffloadHandler): ...
Import
from transformer_engine.pytorch.cpu_offload_v1 import (
get_cpu_offload_context,
is_cpu_offload_enabled,
is_current_layer_offloaded,
)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| enabled | bool |
Yes | Whether to enable CPU offloading |
| num_layers | int |
No | Number of transformer layers (for buffer management) |
| tensors | torch.Tensor |
No | Tensors to mark for offloading |
Outputs
| Name | Type | Description |
|---|---|---|
| context | contextmanager |
Context manager enabling offloading within its scope |
Usage Examples
import os
os.environ["NVTE_CPU_OFFLOAD_V1"] = "1"
from transformer_engine.pytorch.cpu_offload_v1 import get_cpu_offload_context
with get_cpu_offload_context(enabled=True):
output = model(input_data)
loss = loss_fn(output, target)
loss.backward()