Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Deepspeedai DeepSpeed Initialize For RM

From Leeroopedia


Overview

Concrete tool for initializing a DeepSpeed engine for reward model training in the RLHF pipeline provided by the DeepSpeed library.

Description

Uses deepspeed.initialize() with a standard DeepSpeedEngine for reward model training. The reward model is typically the SFT model with a linear head added for scalar reward prediction. Configuration uses ZeRO Stage 2 or 3 with the same mixed precision settings as SFT. Since reward model training involves only forward and backward passes without text generation, the hybrid engine is not required.

The reward model architecture adds a scalar projection head to the base SFT model. During initialization, deepspeed.initialize() wraps this modified model in a DeepSpeedEngine that handles ZeRO-partitioned optimizer states, gradient communication across data-parallel ranks, and mixed-precision training. The resulting engine supports the standard training loop pattern where comparison pairs are processed, the Bradley-Terry loss is computed, and gradients are propagated through both the scalar head and the base transformer.

Code Reference

Property Value
Repository https://github.com/deepspeedai/DeepSpeed
File deepspeed/__init__.py (L80-252), deepspeed/runtime/engine.py (L206-420)
Signature def initialize(args=None, model=None, optimizer=None, model_parameters=None, training_data=None, lr_scheduler=None, distributed_port=29500, mpu=None, dist_init_required=None, collate_fn=None, config=None, mesh_param=None, config_params=None) -> Tuple[DeepSpeedEngine, Optimizer, DataLoader, LRScheduler]
Import import deepspeed

I/O Contract

Inputs

Name Type Required Description
model torch.nn.Module Yes SFT model with reward scalar head appended
config dict or str Yes DeepSpeed configuration with ZeRO settings
training_data torch.utils.data.Dataset No Comparison dataset with preferred and rejected pairs
model_parameters iterable No Parameters to optimize
optimizer Optimizer No Custom optimizer (otherwise created from config)

Outputs

Name Type Description
engine DeepSpeedEngine Wrapped reward model for distributed training
optimizer Optimizer Wrapped optimizer instance
dataloader DataLoader DataLoader if training_data was provided, otherwise None
lr_scheduler LRScheduler Learning rate scheduler if configured, otherwise None

Usage Example

import deepspeed

# RewardModel adds a scalar head to the SFT model
reward_model = RewardModel.from_pretrained("sft_checkpoint/")
rm_config = {
    "train_batch_size": 32,
    "zero_optimization": {"stage": 2},
    "bf16": {"enabled": True}
}
engine, _, _, _ = deepspeed.initialize(
    model=reward_model,
    config=rm_config,
    model_parameters=reward_model.parameters()
)
# Reward model training loop
for batch in rm_dataloader:
    chosen_rewards = engine(batch["chosen_input_ids"])
    rejected_rewards = engine(batch["rejected_input_ids"])
    loss = -torch.log(torch.sigmoid(chosen_rewards - rejected_rewards)).mean()
    engine.backward(loss)
    engine.step()
engine.save_checkpoint("reward_model_checkpoint/")

Related Pages

Knowledge Sources

Last updated: 2026-02-09 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment