Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Deepspeedai DeepSpeed OpBuilder

From Leeroopedia


Knowledge Sources
Domains Build_System, Compilation, CUDA, Extensions
Last Updated 2026-02-09 00:00 GMT

Overview

OpBuilder is the abstract base class for building and loading DeepSpeed C++/CUDA extension operators.

Description

The OpBuilder class provides a framework for compiling and loading custom C++/CUDA operations in DeepSpeed. It supports both ahead-of-time (AOT) compilation during installation and just-in-time (JIT) compilation at runtime. The class handles platform-specific compilation flags, CUDA/ROCm compatibility checks, compute capability detection, and CPU architecture optimization. Each DeepSpeed operator (such as adam, transformer, async_io) inherits from either OpBuilder or its subclass CUDAOpBuilder to implement operator-specific build configurations.

The builder system manages complex compilation scenarios including cross-platform builds (CUDA/ROCm/CPU), version compatibility validation between PyTorch and CUDA, and hardware-specific optimizations like SIMD width detection and compute capability targeting. It also provides utility functions for system introspection including CPU architecture detection, CUDA version checking, and library dependency testing.

Usage

Use OpBuilder as a base class when implementing custom DeepSpeed operators that need to be compiled from C++/CUDA source. The class is typically subclassed to create operator-specific builders (e.g., AsyncIOBuilder, FusedAdamBuilder) that define source files, include paths, and compilation flags. During package installation, builders determine which operators to pre-compile based on environment variables and hardware compatibility. At runtime, operators can be loaded via the load() method which either imports pre-compiled modules or triggers JIT compilation.

Code Reference

Source Location

Signature

class OpBuilder(ABC):
    """Abstract base class for building DeepSpeed C++/CUDA extensions."""

    def __init__(self, name: str):
        """Initialize builder with operator name."""

    @abstractmethod
    def absolute_name(self) -> str:
        """Returns absolute module path for pre-installed op."""

    @abstractmethod
    def sources(self) -> List[str]:
        """Returns list of source files relative to deepspeed package root."""

    def include_paths(self) -> List[str]:
        """Returns list of include directories."""

    def nvcc_args(self) -> List[str]:
        """Returns nvcc compiler flags for CUDA compilation."""

    def cxx_args(self) -> List[str]:
        """Returns C++ compiler flags."""

    def is_compatible(self, verbose: bool = False) -> bool:
        """Check if all dependencies are satisfied to build this op."""

    def load(self, verbose: bool = False):
        """Load pre-compiled op or trigger JIT compilation."""

    def jit_load(self, verbose: bool = True):
        """Just-in-time compile and load the operator."""

    def builder(self):
        """Returns torch.utils.cpp_extension builder object."""

class CUDAOpBuilder(OpBuilder):
    """Builder for CUDA-enabled operations."""

    def compute_capability_args(self, cross_compile_archs: str = None) -> List[str]:
        """Returns nvcc compute capability compile flags."""

    def filter_ccs(self, ccs: List[str]) -> List[str]:
        """Prune incompatible compute capabilities."""

    def version_dependent_macros(self) -> List[str]:
        """Returns version-specific preprocessor macros."""

class TorchCPUOpBuilder(CUDAOpBuilder):
    """Builder for CPU-optimized operations with optional CUDA support."""

Import

from op_builder.builder import OpBuilder, CUDAOpBuilder, TorchCPUOpBuilder

# Subclass to create custom operator builder
class MyOpBuilder(CUDAOpBuilder):
    def __init__(self):
        super().__init__(name="my_op")

    def absolute_name(self):
        return "deepspeed.ops.my_op"

    def sources(self):
        return ["csrc/my_op/my_op.cpp", "csrc/my_op/my_op_cuda.cu"]

I/O Contract

OpBuilder Methods

Method Input Output Description
__init__ name: str None Initialize builder with operator name
absolute_name None str Returns module path like "deepspeed.ops.adam.cpu_adam"
sources None List[str] Returns source file paths relative to deepspeed root
include_paths None List[str] Returns include directory paths
nvcc_args None List[str] Returns NVCC compiler flags (e.g., -std=c++17, -O3)
cxx_args None List[str] Returns C++ compiler flags (e.g., -O3, -fopenmp)
extra_ldflags None List[str] Returns linker flags
is_compatible verbose: bool bool True if op can be built on current system
load verbose: bool module Returns loaded operator module
jit_load verbose: bool module JIT compile and return operator module
builder None Extension Returns torch.utils.cpp_extension builder object

CUDAOpBuilder Specific Methods

Method Input Output Description
compute_capability_args cross_compile_archs: str List[str] Returns -gencode flags for target GPU architectures
filter_ccs ccs: List[str] List[str] Filters out incompatible compute capabilities
version_dependent_macros None List[str] Returns version macros like -DVERSION_GE_1_5

Static Utility Methods

Method Input Output Description
is_rocm_pytorch None bool Returns True if using ROCm/HIP backend
is_sycl_enabled None bool Returns True if Intel SYCL compiler available
installed_rocm_version None Tuple[int, int] Returns (major, minor) ROCm version
get_rocm_gpu_arch None str Returns GPU architecture (e.g., "gfx908")

Environment Variables

Variable Type Default Description
DS_BUILD_OPS int 0 (Linux), 1 (Windows) Enable/disable pre-compilation of ops
DS_BUILD_{OP_NAME} int DS_BUILD_OPS value Enable specific op (e.g., DS_BUILD_CPU_ADAM=1)
TORCH_CUDA_ARCH_LIST str Auto-detected Semicolon-separated compute capabilities (e.g., "6.1;7.5;8.6")
DS_SKIP_CUDA_CHECK int 0 Skip CUDA version mismatch validation (use with caution)
DS_ENABLE_NINJA int 0 Enable ninja build system
DS_NVCC_THREADS int min(cpu_count, 8) Number of parallel nvcc threads
DS_DEBUG_CUDA_BUILD int 0 Enable verbose CUDA compilation output
CUDA_HOME str Auto-detected Path to CUDA toolkit installation
ROCM_HOME str Auto-detected Path to ROCm installation

Usage Examples

Creating a Custom Operator Builder

from op_builder.builder import CUDAOpBuilder

class TransformerBuilder(CUDAOpBuilder):
    BUILD_VAR = "DS_BUILD_TRANSFORMER"
    NAME = "transformer"

    def __init__(self):
        super().__init__(name=self.NAME)

    def absolute_name(self):
        return f'deepspeed.ops.transformer.{self.NAME}_op'

    def sources(self):
        return [
            'csrc/transformer/ds_transformer_cuda.cpp',
            'csrc/transformer/cublas_wrappers.cu',
            'csrc/transformer/transform_kernels.cu',
        ]

    def include_paths(self):
        return ['csrc/includes']

    def cxx_args(self):
        args = super().cxx_args()
        args += ['-O3', '-std=c++17', '-fopenmp']
        return args

Loading an Operator (AOT or JIT)

from op_builder import TransformerBuilder

# Initialize builder
builder = TransformerBuilder()

# Check if operator is compatible with current system
if not builder.is_compatible(verbose=True):
    print(f"Transformer op not compatible: {builder.error_log}")
    exit(1)

# Load operator (uses pre-compiled if available, otherwise JIT compiles)
transformer_op = builder.load(verbose=True)

# Use the loaded operator
output = transformer_op.forward(input_tensor, weights, config)

Checking Compatibility Before Build

from op_builder import CPUAdamBuilder
from op_builder.builder import assert_no_cuda_mismatch

builder = CPUAdamBuilder()

# Check basic compatibility
if builder.is_compatible(verbose=True):
    print("CPU Adam op is compatible")

# Check CUDA version compatibility
try:
    assert_no_cuda_mismatch("cpu_adam")
    print("CUDA versions match")
except Exception as e:
    print(f"CUDA mismatch: {e}")

Detecting Hardware Capabilities

from op_builder.builder import OpBuilder, get_default_compute_capabilities

# Check if using ROCm
if OpBuilder.is_rocm_pytorch():
    print(f"Using ROCm version: {OpBuilder.installed_rocm_version()}")
    print(f"GPU Architecture: {OpBuilder.get_rocm_gpu_arch()}")
    print(f"Wavefront Size: {OpBuilder.get_rocm_wavefront_size()}")

# Get default compute capabilities for current system
compute_caps = get_default_compute_capabilities()
print(f"Target compute capabilities: {compute_caps}")

# Check CPU architecture
builder = OpBuilder("test")
print(f"CPU architecture flag: {builder.cpu_arch()}")
print(f"SIMD width: {builder.simd_width()}")

Building for Specific Compute Capabilities

import os
from op_builder import FusedAdamBuilder

# Set target architectures (Ampere, Ada, Hopper)
os.environ["TORCH_CUDA_ARCH_LIST"] = "8.0;8.6;9.0"

builder = FusedAdamBuilder()

# Get compute capability flags
cc_args = builder.compute_capability_args()
print(f"Compile flags: {cc_args}")
# Output: ['-gencode=arch=compute_80,code=sm_80',
#          '-gencode=arch=compute_86,code=sm_86',
#          '-gencode=arch=compute_90,code=sm_90']

# Build extension
ext = builder.builder()

Testing Library Dependencies

from op_builder.builder import OpBuilder

class MyOpBuilder(OpBuilder):
    def is_compatible(self, verbose=False):
        # Check if cuBLAS is available
        if not self.has_function('cublasCreate', ['cublas'], verbose=verbose):
            self.warning("cuBLAS library not found")
            return False

        # Check if custom library exists
        if not self.command_exists('custom-tool'):
            self.warning("custom-tool command not found")
            return False

        return super().is_compatible(verbose)

Related Pages

  • Setup - Package installation and build configuration
  • CPU_Adam - Example operator using OpBuilder
  • Async_IO - I/O operator using TorchCPUOpBuilder
  • Transformer - Transformer operator using CUDAOpBuilder
  • JIT_Compilation - Just-in-time compilation system
  • Accelerator - Hardware accelerator abstraction layer

Build Process Flow

The OpBuilder system follows this workflow:

1. Installation Time (setup.py):

  • Reads DS_BUILD_OPS and DS_BUILD_{OP_NAME} environment variables
  • Calls is_compatible() on each operator builder
  • For compatible ops with building enabled, calls builder() to create Extension objects
  • Passes extensions to setuptools for ahead-of-time compilation

2. Runtime (first load):

  • Application calls builder.load()
  • Checks if op was pre-compiled during installation
  • If pre-compiled: validates torch/CUDA versions and imports module
  • If not pre-compiled: calls jit_load() to compile on-demand

3. JIT Compilation:

  • Verifies ninja build system availability
  • Detects current GPU compute capabilities
  • Generates appropriate compiler flags
  • Calls torch.utils.cpp_extension.load() with sources and flags
  • Caches compiled module for future loads

Hardware Support

The OpBuilder system supports multiple hardware platforms:

Platform Detection Method Key Features
NVIDIA CUDA torch.version.cuda Compute capability targeting, cuBLAS/cuRAND linking, NVCC compilation
AMD ROCm torch.version.hip HIP code generation via hipify, gfx architecture detection, wavefront size configuration
Intel CPU cpuinfo/lscpu AVX512/AVX2 detection, march=native optimization, OpenMP threading
Intel SYCL c2s command SYCL extension transformation for Intel GPUs

Common Build Patterns

Conditional CUDA/CPU Compilation:

def sources(self):
    sources = ['csrc/adam/cpu_adam.cpp']
    if not self.build_for_cpu:
        sources.append('csrc/adam/cuda_adam.cu')
    return sources

Platform-Specific Flags:

def cxx_args(self):
    args = ['-O3', '-std=c++17']
    if sys.platform == "win32":
        args = ['-O2']  # Different optimization on Windows
    else:
        args += ['-fopenmp', '-Wno-reorder']
    return args

Version-Dependent Compilation:

def nvcc_args(self):
    cuda_major, cuda_minor = installed_cuda_version()
    if cuda_major > 10:
        std_lib = '-std=c++17'
    else:
        std_lib = '-std=c++14'
    return ['-O3', std_lib, '--use_fast_math']

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment