Implementation:Deepspeedai DeepSpeed OpBuilder
| Knowledge Sources | |
|---|---|
| Domains | Build_System, Compilation, CUDA, Extensions |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
OpBuilder is the abstract base class for building and loading DeepSpeed C++/CUDA extension operators.
Description
The OpBuilder class provides a framework for compiling and loading custom C++/CUDA operations in DeepSpeed. It supports both ahead-of-time (AOT) compilation during installation and just-in-time (JIT) compilation at runtime. The class handles platform-specific compilation flags, CUDA/ROCm compatibility checks, compute capability detection, and CPU architecture optimization. Each DeepSpeed operator (such as adam, transformer, async_io) inherits from either OpBuilder or its subclass CUDAOpBuilder to implement operator-specific build configurations.
The builder system manages complex compilation scenarios including cross-platform builds (CUDA/ROCm/CPU), version compatibility validation between PyTorch and CUDA, and hardware-specific optimizations like SIMD width detection and compute capability targeting. It also provides utility functions for system introspection including CPU architecture detection, CUDA version checking, and library dependency testing.
Usage
Use OpBuilder as a base class when implementing custom DeepSpeed operators that need to be compiled from C++/CUDA source. The class is typically subclassed to create operator-specific builders (e.g., AsyncIOBuilder, FusedAdamBuilder) that define source files, include paths, and compilation flags. During package installation, builders determine which operators to pre-compile based on environment variables and hardware compatibility. At runtime, operators can be loaded via the load() method which either imports pre-compiled modules or triggers JIT compilation.
Code Reference
Source Location
- Repository: DeepSpeed
- File: op_builder/builder.py
Signature
class OpBuilder(ABC):
"""Abstract base class for building DeepSpeed C++/CUDA extensions."""
def __init__(self, name: str):
"""Initialize builder with operator name."""
@abstractmethod
def absolute_name(self) -> str:
"""Returns absolute module path for pre-installed op."""
@abstractmethod
def sources(self) -> List[str]:
"""Returns list of source files relative to deepspeed package root."""
def include_paths(self) -> List[str]:
"""Returns list of include directories."""
def nvcc_args(self) -> List[str]:
"""Returns nvcc compiler flags for CUDA compilation."""
def cxx_args(self) -> List[str]:
"""Returns C++ compiler flags."""
def is_compatible(self, verbose: bool = False) -> bool:
"""Check if all dependencies are satisfied to build this op."""
def load(self, verbose: bool = False):
"""Load pre-compiled op or trigger JIT compilation."""
def jit_load(self, verbose: bool = True):
"""Just-in-time compile and load the operator."""
def builder(self):
"""Returns torch.utils.cpp_extension builder object."""
class CUDAOpBuilder(OpBuilder):
"""Builder for CUDA-enabled operations."""
def compute_capability_args(self, cross_compile_archs: str = None) -> List[str]:
"""Returns nvcc compute capability compile flags."""
def filter_ccs(self, ccs: List[str]) -> List[str]:
"""Prune incompatible compute capabilities."""
def version_dependent_macros(self) -> List[str]:
"""Returns version-specific preprocessor macros."""
class TorchCPUOpBuilder(CUDAOpBuilder):
"""Builder for CPU-optimized operations with optional CUDA support."""
Import
from op_builder.builder import OpBuilder, CUDAOpBuilder, TorchCPUOpBuilder
# Subclass to create custom operator builder
class MyOpBuilder(CUDAOpBuilder):
def __init__(self):
super().__init__(name="my_op")
def absolute_name(self):
return "deepspeed.ops.my_op"
def sources(self):
return ["csrc/my_op/my_op.cpp", "csrc/my_op/my_op_cuda.cu"]
I/O Contract
OpBuilder Methods
| Method | Input | Output | Description |
|---|---|---|---|
| __init__ | name: str | None | Initialize builder with operator name |
| absolute_name | None | str | Returns module path like "deepspeed.ops.adam.cpu_adam" |
| sources | None | List[str] | Returns source file paths relative to deepspeed root |
| include_paths | None | List[str] | Returns include directory paths |
| nvcc_args | None | List[str] | Returns NVCC compiler flags (e.g., -std=c++17, -O3) |
| cxx_args | None | List[str] | Returns C++ compiler flags (e.g., -O3, -fopenmp) |
| extra_ldflags | None | List[str] | Returns linker flags |
| is_compatible | verbose: bool | bool | True if op can be built on current system |
| load | verbose: bool | module | Returns loaded operator module |
| jit_load | verbose: bool | module | JIT compile and return operator module |
| builder | None | Extension | Returns torch.utils.cpp_extension builder object |
CUDAOpBuilder Specific Methods
| Method | Input | Output | Description |
|---|---|---|---|
| compute_capability_args | cross_compile_archs: str | List[str] | Returns -gencode flags for target GPU architectures |
| filter_ccs | ccs: List[str] | List[str] | Filters out incompatible compute capabilities |
| version_dependent_macros | None | List[str] | Returns version macros like -DVERSION_GE_1_5 |
Static Utility Methods
| Method | Input | Output | Description |
|---|---|---|---|
| is_rocm_pytorch | None | bool | Returns True if using ROCm/HIP backend |
| is_sycl_enabled | None | bool | Returns True if Intel SYCL compiler available |
| installed_rocm_version | None | Tuple[int, int] | Returns (major, minor) ROCm version |
| get_rocm_gpu_arch | None | str | Returns GPU architecture (e.g., "gfx908") |
Environment Variables
| Variable | Type | Default | Description |
|---|---|---|---|
| DS_BUILD_OPS | int | 0 (Linux), 1 (Windows) | Enable/disable pre-compilation of ops |
| DS_BUILD_{OP_NAME} | int | DS_BUILD_OPS value | Enable specific op (e.g., DS_BUILD_CPU_ADAM=1) |
| TORCH_CUDA_ARCH_LIST | str | Auto-detected | Semicolon-separated compute capabilities (e.g., "6.1;7.5;8.6") |
| DS_SKIP_CUDA_CHECK | int | 0 | Skip CUDA version mismatch validation (use with caution) |
| DS_ENABLE_NINJA | int | 0 | Enable ninja build system |
| DS_NVCC_THREADS | int | min(cpu_count, 8) | Number of parallel nvcc threads |
| DS_DEBUG_CUDA_BUILD | int | 0 | Enable verbose CUDA compilation output |
| CUDA_HOME | str | Auto-detected | Path to CUDA toolkit installation |
| ROCM_HOME | str | Auto-detected | Path to ROCm installation |
Usage Examples
Creating a Custom Operator Builder
from op_builder.builder import CUDAOpBuilder
class TransformerBuilder(CUDAOpBuilder):
BUILD_VAR = "DS_BUILD_TRANSFORMER"
NAME = "transformer"
def __init__(self):
super().__init__(name=self.NAME)
def absolute_name(self):
return f'deepspeed.ops.transformer.{self.NAME}_op'
def sources(self):
return [
'csrc/transformer/ds_transformer_cuda.cpp',
'csrc/transformer/cublas_wrappers.cu',
'csrc/transformer/transform_kernels.cu',
]
def include_paths(self):
return ['csrc/includes']
def cxx_args(self):
args = super().cxx_args()
args += ['-O3', '-std=c++17', '-fopenmp']
return args
Loading an Operator (AOT or JIT)
from op_builder import TransformerBuilder
# Initialize builder
builder = TransformerBuilder()
# Check if operator is compatible with current system
if not builder.is_compatible(verbose=True):
print(f"Transformer op not compatible: {builder.error_log}")
exit(1)
# Load operator (uses pre-compiled if available, otherwise JIT compiles)
transformer_op = builder.load(verbose=True)
# Use the loaded operator
output = transformer_op.forward(input_tensor, weights, config)
Checking Compatibility Before Build
from op_builder import CPUAdamBuilder
from op_builder.builder import assert_no_cuda_mismatch
builder = CPUAdamBuilder()
# Check basic compatibility
if builder.is_compatible(verbose=True):
print("CPU Adam op is compatible")
# Check CUDA version compatibility
try:
assert_no_cuda_mismatch("cpu_adam")
print("CUDA versions match")
except Exception as e:
print(f"CUDA mismatch: {e}")
Detecting Hardware Capabilities
from op_builder.builder import OpBuilder, get_default_compute_capabilities
# Check if using ROCm
if OpBuilder.is_rocm_pytorch():
print(f"Using ROCm version: {OpBuilder.installed_rocm_version()}")
print(f"GPU Architecture: {OpBuilder.get_rocm_gpu_arch()}")
print(f"Wavefront Size: {OpBuilder.get_rocm_wavefront_size()}")
# Get default compute capabilities for current system
compute_caps = get_default_compute_capabilities()
print(f"Target compute capabilities: {compute_caps}")
# Check CPU architecture
builder = OpBuilder("test")
print(f"CPU architecture flag: {builder.cpu_arch()}")
print(f"SIMD width: {builder.simd_width()}")
Building for Specific Compute Capabilities
import os
from op_builder import FusedAdamBuilder
# Set target architectures (Ampere, Ada, Hopper)
os.environ["TORCH_CUDA_ARCH_LIST"] = "8.0;8.6;9.0"
builder = FusedAdamBuilder()
# Get compute capability flags
cc_args = builder.compute_capability_args()
print(f"Compile flags: {cc_args}")
# Output: ['-gencode=arch=compute_80,code=sm_80',
# '-gencode=arch=compute_86,code=sm_86',
# '-gencode=arch=compute_90,code=sm_90']
# Build extension
ext = builder.builder()
Testing Library Dependencies
from op_builder.builder import OpBuilder
class MyOpBuilder(OpBuilder):
def is_compatible(self, verbose=False):
# Check if cuBLAS is available
if not self.has_function('cublasCreate', ['cublas'], verbose=verbose):
self.warning("cuBLAS library not found")
return False
# Check if custom library exists
if not self.command_exists('custom-tool'):
self.warning("custom-tool command not found")
return False
return super().is_compatible(verbose)
Related Pages
- Setup - Package installation and build configuration
- CPU_Adam - Example operator using OpBuilder
- Async_IO - I/O operator using TorchCPUOpBuilder
- Transformer - Transformer operator using CUDAOpBuilder
- JIT_Compilation - Just-in-time compilation system
- Accelerator - Hardware accelerator abstraction layer
Build Process Flow
The OpBuilder system follows this workflow:
1. Installation Time (setup.py):
- Reads DS_BUILD_OPS and DS_BUILD_{OP_NAME} environment variables
- Calls is_compatible() on each operator builder
- For compatible ops with building enabled, calls builder() to create Extension objects
- Passes extensions to setuptools for ahead-of-time compilation
2. Runtime (first load):
- Application calls builder.load()
- Checks if op was pre-compiled during installation
- If pre-compiled: validates torch/CUDA versions and imports module
- If not pre-compiled: calls jit_load() to compile on-demand
3. JIT Compilation:
- Verifies ninja build system availability
- Detects current GPU compute capabilities
- Generates appropriate compiler flags
- Calls torch.utils.cpp_extension.load() with sources and flags
- Caches compiled module for future loads
Hardware Support
The OpBuilder system supports multiple hardware platforms:
| Platform | Detection Method | Key Features |
|---|---|---|
| NVIDIA CUDA | torch.version.cuda | Compute capability targeting, cuBLAS/cuRAND linking, NVCC compilation |
| AMD ROCm | torch.version.hip | HIP code generation via hipify, gfx architecture detection, wavefront size configuration |
| Intel CPU | cpuinfo/lscpu | AVX512/AVX2 detection, march=native optimization, OpenMP threading |
| Intel SYCL | c2s command | SYCL extension transformation for Intel GPUs |
Common Build Patterns
Conditional CUDA/CPU Compilation:
def sources(self):
sources = ['csrc/adam/cpu_adam.cpp']
if not self.build_for_cpu:
sources.append('csrc/adam/cuda_adam.cu')
return sources
Platform-Specific Flags:
def cxx_args(self):
args = ['-O3', '-std=c++17']
if sys.platform == "win32":
args = ['-O2'] # Different optimization on Windows
else:
args += ['-fopenmp', '-Wno-reorder']
return args
Version-Dependent Compilation:
def nvcc_args(self):
cuda_major, cuda_minor = installed_cuda_version()
if cuda_major > 10:
std_lib = '-std=c++17'
else:
std_lib = '-std=c++14'
return ['-O3', std_lib, '--use_fast_math']