Implementation:Deepspeedai DeepSpeed OpBuilder

Knowledge Sources	DeepSpeed
Domains	Build_System, Compilation, CUDA, Extensions
Last Updated	2026-02-09 00:00 GMT

Overview

OpBuilder is the abstract base class for building and loading DeepSpeed C++/CUDA extension operators.

Description

The OpBuilder class provides a framework for compiling and loading custom C++/CUDA operations in DeepSpeed. It supports both ahead-of-time (AOT) compilation during installation and just-in-time (JIT) compilation at runtime. The class handles platform-specific compilation flags, CUDA/ROCm compatibility checks, compute capability detection, and CPU architecture optimization. Each DeepSpeed operator (such as adam, transformer, async_io) inherits from either OpBuilder or its subclass CUDAOpBuilder to implement operator-specific build configurations.

The builder system manages complex compilation scenarios including cross-platform builds (CUDA/ROCm/CPU), version compatibility validation between PyTorch and CUDA, and hardware-specific optimizations like SIMD width detection and compute capability targeting. It also provides utility functions for system introspection including CPU architecture detection, CUDA version checking, and library dependency testing.

Usage

Use OpBuilder as a base class when implementing custom DeepSpeed operators that need to be compiled from C++/CUDA source. The class is typically subclassed to create operator-specific builders (e.g., AsyncIOBuilder, FusedAdamBuilder) that define source files, include paths, and compilation flags. During package installation, builders determine which operators to pre-compile based on environment variables and hardware compatibility. At runtime, operators can be loaded via the load() method which either imports pre-compiled modules or triggers JIT compilation.

Code Reference

Source Location

Repository: DeepSpeed
File: op_builder/builder.py

Signature

class OpBuilder(ABC):
    """Abstract base class for building DeepSpeed C++/CUDA extensions."""

    def __init__(self, name: str):
        """Initialize builder with operator name."""

    @abstractmethod
    def absolute_name(self) -> str:
        """Returns absolute module path for pre-installed op."""

    @abstractmethod
    def sources(self) -> List[str]:
        """Returns list of source files relative to deepspeed package root."""

    def include_paths(self) -> List[str]:
        """Returns list of include directories."""

    def nvcc_args(self) -> List[str]:
        """Returns nvcc compiler flags for CUDA compilation."""

    def cxx_args(self) -> List[str]:
        """Returns C++ compiler flags."""

    def is_compatible(self, verbose: bool = False) -> bool:
        """Check if all dependencies are satisfied to build this op."""

    def load(self, verbose: bool = False):
        """Load pre-compiled op or trigger JIT compilation."""

    def jit_load(self, verbose: bool = True):
        """Just-in-time compile and load the operator."""

    def builder(self):
        """Returns torch.utils.cpp_extension builder object."""

class CUDAOpBuilder(OpBuilder):
    """Builder for CUDA-enabled operations."""

    def compute_capability_args(self, cross_compile_archs: str = None) -> List[str]:
        """Returns nvcc compute capability compile flags."""

    def filter_ccs(self, ccs: List[str]) -> List[str]:
        """Prune incompatible compute capabilities."""

    def version_dependent_macros(self) -> List[str]:
        """Returns version-specific preprocessor macros."""

class TorchCPUOpBuilder(CUDAOpBuilder):
    """Builder for CPU-optimized operations with optional CUDA support."""

Import

from op_builder.builder import OpBuilder, CUDAOpBuilder, TorchCPUOpBuilder

# Subclass to create custom operator builder
class MyOpBuilder(CUDAOpBuilder):
    def __init__(self):
        super().__init__(name="my_op")

    def absolute_name(self):
        return "deepspeed.ops.my_op"

    def sources(self):
        return ["csrc/my_op/my_op.cpp", "csrc/my_op/my_op_cuda.cu"]

I/O Contract

OpBuilder Methods

Method	Input	Output	Description
__init__	name: str	None	Initialize builder with operator name
absolute_name	None	str	Returns module path like "deepspeed.ops.adam.cpu_adam"
sources	None	List[str]	Returns source file paths relative to deepspeed root
include_paths	None	List[str]	Returns include directory paths
nvcc_args	None	List[str]	Returns NVCC compiler flags (e.g., -std=c++17, -O3)
cxx_args	None	List[str]	Returns C++ compiler flags (e.g., -O3, -fopenmp)
extra_ldflags	None	List[str]	Returns linker flags
is_compatible	verbose: bool	bool	True if op can be built on current system
load	verbose: bool	module	Returns loaded operator module
jit_load	verbose: bool	module	JIT compile and return operator module
builder	None	Extension	Returns torch.utils.cpp_extension builder object

CUDAOpBuilder Specific Methods

Method	Input	Output	Description
compute_capability_args	cross_compile_archs: str	List[str]	Returns -gencode flags for target GPU architectures
filter_ccs	ccs: List[str]	List[str]	Filters out incompatible compute capabilities
version_dependent_macros	None	List[str]	Returns version macros like -DVERSION_GE_1_5

Static Utility Methods

Method	Input	Output	Description
is_rocm_pytorch	None	bool	Returns True if using ROCm/HIP backend
is_sycl_enabled	None	bool	Returns True if Intel SYCL compiler available
installed_rocm_version	None	Tuple[int, int]	Returns (major, minor) ROCm version
get_rocm_gpu_arch	None	str	Returns GPU architecture (e.g., "gfx908")

Environment Variables

Variable	Type	Default	Description
DS_BUILD_OPS	int	0 (Linux), 1 (Windows)	Enable/disable pre-compilation of ops
DS_BUILD_{OP_NAME}	int	DS_BUILD_OPS value	Enable specific op (e.g., DS_BUILD_CPU_ADAM=1)
TORCH_CUDA_ARCH_LIST	str	Auto-detected	Semicolon-separated compute capabilities (e.g., "6.1;7.5;8.6")
DS_SKIP_CUDA_CHECK	int	0	Skip CUDA version mismatch validation (use with caution)
DS_ENABLE_NINJA	int	0	Enable ninja build system
DS_NVCC_THREADS	int	min(cpu_count, 8)	Number of parallel nvcc threads
DS_DEBUG_CUDA_BUILD	int	0	Enable verbose CUDA compilation output
CUDA_HOME	str	Auto-detected	Path to CUDA toolkit installation
ROCM_HOME	str	Auto-detected	Path to ROCm installation

Usage Examples

Creating a Custom Operator Builder

from op_builder.builder import CUDAOpBuilder

class TransformerBuilder(CUDAOpBuilder):
    BUILD_VAR = "DS_BUILD_TRANSFORMER"
    NAME = "transformer"

    def __init__(self):
        super().__init__(name=self.NAME)

    def absolute_name(self):
        return f'deepspeed.ops.transformer.{self.NAME}_op'

    def sources(self):
        return [
            'csrc/transformer/ds_transformer_cuda.cpp',
            'csrc/transformer/cublas_wrappers.cu',
            'csrc/transformer/transform_kernels.cu',
        ]

    def include_paths(self):
        return ['csrc/includes']

    def cxx_args(self):
        args = super().cxx_args()
        args += ['-O3', '-std=c++17', '-fopenmp']
        return args

Loading an Operator (AOT or JIT)

from op_builder import TransformerBuilder

# Initialize builder
builder = TransformerBuilder()

# Check if operator is compatible with current system
if not builder.is_compatible(verbose=True):
    print(f"Transformer op not compatible: {builder.error_log}")
    exit(1)

# Load operator (uses pre-compiled if available, otherwise JIT compiles)
transformer_op = builder.load(verbose=True)

# Use the loaded operator
output = transformer_op.forward(input_tensor, weights, config)

Checking Compatibility Before Build

from op_builder import CPUAdamBuilder
from op_builder.builder import assert_no_cuda_mismatch

builder = CPUAdamBuilder()

# Check basic compatibility
if builder.is_compatible(verbose=True):
    print("CPU Adam op is compatible")

# Check CUDA version compatibility
try:
    assert_no_cuda_mismatch("cpu_adam")
    print("CUDA versions match")
except Exception as e:
    print(f"CUDA mismatch: {e}")

Detecting Hardware Capabilities

from op_builder.builder import OpBuilder, get_default_compute_capabilities

# Check if using ROCm
if OpBuilder.is_rocm_pytorch():
    print(f"Using ROCm version: {OpBuilder.installed_rocm_version()}")
    print(f"GPU Architecture: {OpBuilder.get_rocm_gpu_arch()}")
    print(f"Wavefront Size: {OpBuilder.get_rocm_wavefront_size()}")

# Get default compute capabilities for current system
compute_caps = get_default_compute_capabilities()
print(f"Target compute capabilities: {compute_caps}")

# Check CPU architecture
builder = OpBuilder("test")
print(f"CPU architecture flag: {builder.cpu_arch()}")
print(f"SIMD width: {builder.simd_width()}")

Building for Specific Compute Capabilities

import os
from op_builder import FusedAdamBuilder

# Set target architectures (Ampere, Ada, Hopper)
os.environ["TORCH_CUDA_ARCH_LIST"] = "8.0;8.6;9.0"

builder = FusedAdamBuilder()

# Get compute capability flags
cc_args = builder.compute_capability_args()
print(f"Compile flags: {cc_args}")
# Output: ['-gencode=arch=compute_80,code=sm_80',
#          '-gencode=arch=compute_86,code=sm_86',
#          '-gencode=arch=compute_90,code=sm_90']

# Build extension
ext = builder.builder()

Testing Library Dependencies

from op_builder.builder import OpBuilder

class MyOpBuilder(OpBuilder):
    def is_compatible(self, verbose=False):
        # Check if cuBLAS is available
        if not self.has_function('cublasCreate', ['cublas'], verbose=verbose):
            self.warning("cuBLAS library not found")
            return False

        # Check if custom library exists
        if not self.command_exists('custom-tool'):
            self.warning("custom-tool command not found")
            return False

        return super().is_compatible(verbose)

Related Pages

Setup - Package installation and build configuration
CPU_Adam - Example operator using OpBuilder
Async_IO - I/O operator using TorchCPUOpBuilder
Transformer - Transformer operator using CUDAOpBuilder
JIT_Compilation - Just-in-time compilation system
Accelerator - Hardware accelerator abstraction layer

Build Process Flow

The OpBuilder system follows this workflow:

1. Installation Time (setup.py):

Reads DS_BUILD_OPS and DS_BUILD_{OP_NAME} environment variables
Calls is_compatible() on each operator builder
For compatible ops with building enabled, calls builder() to create Extension objects
Passes extensions to setuptools for ahead-of-time compilation

2. Runtime (first load):

Application calls builder.load()
Checks if op was pre-compiled during installation
If pre-compiled: validates torch/CUDA versions and imports module
If not pre-compiled: calls jit_load() to compile on-demand

3. JIT Compilation:

Verifies ninja build system availability
Detects current GPU compute capabilities
Generates appropriate compiler flags
Calls torch.utils.cpp_extension.load() with sources and flags
Caches compiled module for future loads

Hardware Support

The OpBuilder system supports multiple hardware platforms:

Platform	Detection Method	Key Features
NVIDIA CUDA	torch.version.cuda	Compute capability targeting, cuBLAS/cuRAND linking, NVCC compilation
AMD ROCm	torch.version.hip	HIP code generation via hipify, gfx architecture detection, wavefront size configuration
Intel CPU	cpuinfo/lscpu	AVX512/AVX2 detection, march=native optimization, OpenMP threading
Intel SYCL	c2s command	SYCL extension transformation for Intel GPUs

Common Build Patterns

Conditional CUDA/CPU Compilation:

def sources(self):
    sources = ['csrc/adam/cpu_adam.cpp']
    if not self.build_for_cpu:
        sources.append('csrc/adam/cuda_adam.cu')
    return sources

Platform-Specific Flags:

def cxx_args(self):
    args = ['-O3', '-std=c++17']
    if sys.platform == "win32":
        args = ['-O2']  # Different optimization on Windows
    else:
        args += ['-fopenmp', '-Wno-reorder']
    return args

Version-Dependent Compilation:

def nvcc_args(self):
    cuda_major, cuda_minor = installed_cuda_version()
    if cuda_major > 10:
        std_lib = '-std=c++17'
    else:
        std_lib = '-std=c++14'
    return ['-O3', std_lib, '--use_fast_math']

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment