Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:NVIDIA TransformerEngine Build Optimization Tips

From Leeroopedia




Knowledge Sources
Domains Infrastructure, Optimization
Last Updated 2026-02-07 21:00 GMT

Overview

Build-time optimization techniques for compiling TransformerEngine, including architecture targeting, parallel job control, and ccache to reduce compilation time from hours to minutes.

Description

TransformerEngine compiles CUDA kernels for multiple GPU architectures by default, which can result in long build times (30+ minutes) and high memory usage. By targeting only the GPU architectures you actually use, enabling ccache, and controlling parallel job counts, build times can be reduced dramatically. This is especially important when iterating during development or when building in CI environments with limited resources.

Usage

Use this heuristic when building TransformerEngine from source and experiencing long compilation times, out-of-memory errors during compilation, or when you only need support for specific GPU architectures.

The Insight (Rule of Thumb)

  • Action 1: Set `NVTE_CUDA_ARCHS` to target only your GPU architecture.
  • Value: `NVTE_CUDA_ARCHS="80;90"` compiles only for A100 and H100, skipping Volta, Turing, Ada, Blackwell.
  • Action 2: Limit parallel jobs for memory-constrained systems.
  • Value: `MAX_JOBS=1 NVTE_BUILD_THREADS_PER_JOB=1` prevents build OOM.
  • Action 3: Enable ccache for incremental builds.
  • Value: `NVTE_USE_CCACHE=1` caches compilation results.
  • Action 4: Select frameworks to build.
  • Value: `NVTE_FRAMEWORK=pytorch` skips building JAX extensions.
  • Trade-off: Architecture targeting means the binary wont work on GPUs not in the target list. FlashAttention-2 compilation is particularly resource-intensive.

Reasoning

Each target GPU architecture requires separate kernel compilation, with each architecture adding significant compile time. The default configuration targets 4-6 architectures. For development, you typically only need your specific GPU. For deployment, you need only the target deployment GPU. The `MAX_JOBS` trick is specifically important because FlashAttention-2 CUDA kernel compilation consumes enormous amounts of memory per compilation unit.

Code Evidence

Architecture selection from `build_tools/utils.py:258-264`:

# Default architecture selection based on CUDA version
if cuda_version >= (13, 0):
    archs = "75;80;89;90;100;120"
elif cuda_version >= (12, 8):
    archs = "70;80;89;90;100;120"
else:
    archs = "70;80;89;90"

Environment variable list from `docs/envvars.rst`:

# Build-time optimization variables
NVTE_CUDA_ARCHS="80;90"          # Target specific architectures
NVTE_BUILD_MAX_JOBS=4             # Limit parallel compilation jobs
NVTE_BUILD_THREADS_PER_JOB=2     # Threads per compilation job
NVTE_USE_CCACHE=1                 # Enable compiler cache
NVTE_FRAMEWORK=pytorch            # Build only PyTorch extensions
NVTE_RELEASE_BUILD=1              # Optimized release build
NVTE_BUILD_DEBUG=0                # Disable debug symbols (default)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment