Implementation:NVIDIA TransformerEngine CPU Offload V1

Field	Value
Sources	TransformerEngine
Domains	Deep_Learning, PyTorch, Optimization
Last Updated	2026-02-07 14:00 GMT

Overview

Legacy (V1) implementation of CPU offloading for activation tensors during training, using an offload handler abstraction with synchronous and async double-buffered strategies.

Description

CpuOffloadSavedTensorHook is a context manager that pushes custom pack/unpack hooks via PyTorch's internal _push_saved_tensors_default_hooks. CpuOffloadHookWithOffloadHandler extends this to delegate actual offload/reload to an OffloadHandler. SynchronizedGroupOffloadHandler performs synchronous D2H/H2D copies in groups. AsyncDoubleBufferGroupOffloadHandler uses double buffering with separate CUDA streams for overlapping data transfers with computation. GroupCommitFunction is a custom autograd Function that triggers group commits during the forward pass. The mark_activation_offload function tags tensors and their data sub-tensors for offloading, including a needs_force_clear flag for QuantizedTensorStorage.

Usage

Retained for backward compatibility via the NVTE_CPU_OFFLOAD_V1 environment variable. The newer cpu_offload.py is the default path.

Code Reference

Source Location

Repository: NVIDIA/TransformerEngine
File: transformer_engine/pytorch/cpu_offload_v1.py
Lines: 1--743

Signature

def mark_activation_offload(*tensors): ...
def is_cpu_offload_enabled() -> bool: ...
def is_current_layer_offloaded() -> bool: ...
def get_cpu_offload_context(enabled: bool = False): ...

class CpuOffloadSavedTensorHook: ...
class CpuOffloadHookWithOffloadHandler(CpuOffloadSavedTensorHook): ...
class OffloadHandler: ...
class GroupCommitFunction(torch.autograd.Function): ...
class SynchronizedGroupOffloadHandler(OffloadHandler): ...
class AsyncDoubleBufferGroupOffloadHandler(SynchronizedGroupOffloadHandler): ...

Import

from transformer_engine.pytorch.cpu_offload_v1 import (
    get_cpu_offload_context,
    is_cpu_offload_enabled,
    is_current_layer_offloaded,
)

I/O Contract

Inputs

Name	Type	Required	Description
enabled	`bool`	Yes	Whether to enable CPU offloading
num_layers	`int`	No	Number of transformer layers (for buffer management)
tensors	`torch.Tensor`	No	Tensors to mark for offloading

Outputs

Name	Type	Description
context	`contextmanager`	Context manager enabling offloading within its scope

Usage Examples

import os
os.environ["NVTE_CPU_OFFLOAD_V1"] = "1"
from transformer_engine.pytorch.cpu_offload_v1 import get_cpu_offload_context

with get_cpu_offload_context(enabled=True):
    output = model(input_data)
    loss = loss_fn(output, target)
    loss.backward()

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment