Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:OpenRLHF OpenRLHF DeepspeedStrategy setup distributed

From Leeroopedia


Knowledge Sources
Domains Distributed_Computing, Training_Infrastructure
Last Updated 2026-02-07 00:00 GMT

Overview

Concrete tool for initializing the distributed training backend provided by OpenRLHF's DeepspeedStrategy.

Description

The setup_distributed method on DeepspeedStrategy initializes the NCCL distributed backend, sets CUDA devices, configures random seeds for reproducibility, and creates a 3D device mesh for data/sequence/tensor parallelism. It also computes the gradient accumulation steps from the configured batch sizes and world size.

Usage

Call this method on a strategy object immediately after creating it with get_strategy and before loading any models or data. It must be called exactly once.

Code Reference

Source Location

  • Repository: OpenRLHF
  • File: openrlhf/utils/deepspeed/deepspeed.py
  • Lines: L79-113

Signature

def setup_distributed(self, timeout=timedelta(minutes=60)) -> None:
    """
    Initialize distributed training backend.

    Args:
        timeout (timedelta): Timeout for distributed initialization.
            Default: 60 minutes. Increase for large clusters.

    Side Effects:
        - Initializes NCCL backend via deepspeed.init_distributed()
        - Sets CUDA device based on LOCAL_RANK
        - Creates device mesh with (dp, sp, tp) dimensions
        - Computes accumulated_gradient from batch sizes
        - Sets up ring attention group if ring_attn_size > 1
    """

Import

from openrlhf.utils.deepspeed import DeepspeedStrategy

I/O Contract

Inputs

Name Type Required Description
timeout timedelta No Distributed init timeout (default 60 min)

Outputs

Name Type Description
(side effect) None Initializes distributed backend in-place
self.world_size int Total number of processes
self.accumulated_gradient int Gradient accumulation steps
self.ds_device_mesh DeviceMesh 3D (dp, sp, tp) device mesh

Usage Examples

Standard Setup

from datetime import timedelta
from openrlhf.utils.utils import get_strategy

strategy = get_strategy(args)
strategy.setup_distributed(timeout=timedelta(minutes=60))

# Now ready for model loading and training
print(f"World size: {strategy.world_size}")
print(f"Gradient accumulation: {strategy.accumulated_gradient}")

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment