Heuristic:NVIDIA TransformerEngine Build Optimization Tips
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Optimization |
| Last Updated | 2026-02-07 21:00 GMT |
Overview
Build-time optimization techniques for compiling TransformerEngine, including architecture targeting, parallel job control, and ccache to reduce compilation time from hours to minutes.
Description
TransformerEngine compiles CUDA kernels for multiple GPU architectures by default, which can result in long build times (30+ minutes) and high memory usage. By targeting only the GPU architectures you actually use, enabling ccache, and controlling parallel job counts, build times can be reduced dramatically. This is especially important when iterating during development or when building in CI environments with limited resources.
Usage
Use this heuristic when building TransformerEngine from source and experiencing long compilation times, out-of-memory errors during compilation, or when you only need support for specific GPU architectures.
The Insight (Rule of Thumb)
- Action 1: Set `NVTE_CUDA_ARCHS` to target only your GPU architecture.
- Value: `NVTE_CUDA_ARCHS="80;90"` compiles only for A100 and H100, skipping Volta, Turing, Ada, Blackwell.
- Action 2: Limit parallel jobs for memory-constrained systems.
- Value: `MAX_JOBS=1 NVTE_BUILD_THREADS_PER_JOB=1` prevents build OOM.
- Action 3: Enable ccache for incremental builds.
- Value: `NVTE_USE_CCACHE=1` caches compilation results.
- Action 4: Select frameworks to build.
- Value: `NVTE_FRAMEWORK=pytorch` skips building JAX extensions.
- Trade-off: Architecture targeting means the binary wont work on GPUs not in the target list. FlashAttention-2 compilation is particularly resource-intensive.
Reasoning
Each target GPU architecture requires separate kernel compilation, with each architecture adding significant compile time. The default configuration targets 4-6 architectures. For development, you typically only need your specific GPU. For deployment, you need only the target deployment GPU. The `MAX_JOBS` trick is specifically important because FlashAttention-2 CUDA kernel compilation consumes enormous amounts of memory per compilation unit.
Code Evidence
Architecture selection from `build_tools/utils.py:258-264`:
# Default architecture selection based on CUDA version
if cuda_version >= (13, 0):
archs = "75;80;89;90;100;120"
elif cuda_version >= (12, 8):
archs = "70;80;89;90;100;120"
else:
archs = "70;80;89;90"
Environment variable list from `docs/envvars.rst`:
# Build-time optimization variables
NVTE_CUDA_ARCHS="80;90" # Target specific architectures
NVTE_BUILD_MAX_JOBS=4 # Limit parallel compilation jobs
NVTE_BUILD_THREADS_PER_JOB=2 # Threads per compilation job
NVTE_USE_CCACHE=1 # Enable compiler cache
NVTE_FRAMEWORK=pytorch # Build only PyTorch extensions
NVTE_RELEASE_BUILD=1 # Optimized release build
NVTE_BUILD_DEBUG=0 # Disable debug symbols (default)