Implementation:Deepspeedai DeepSpeed Initialize For RM

Overview

Concrete tool for initializing a DeepSpeed engine for reward model training in the RLHF pipeline provided by the DeepSpeed library.

Description

Uses deepspeed.initialize() with a standard DeepSpeedEngine for reward model training. The reward model is typically the SFT model with a linear head added for scalar reward prediction. Configuration uses ZeRO Stage 2 or 3 with the same mixed precision settings as SFT. Since reward model training involves only forward and backward passes without text generation, the hybrid engine is not required.

The reward model architecture adds a scalar projection head to the base SFT model. During initialization, deepspeed.initialize() wraps this modified model in a DeepSpeedEngine that handles ZeRO-partitioned optimizer states, gradient communication across data-parallel ranks, and mixed-precision training. The resulting engine supports the standard training loop pattern where comparison pairs are processed, the Bradley-Terry loss is computed, and gradients are propagated through both the scalar head and the base transformer.

Code Reference

Property	Value
Repository	https://github.com/deepspeedai/DeepSpeed
File	`deepspeed/__init__.py` (L80-252), `deepspeed/runtime/engine.py` (L206-420)
Signature	`def initialize(args=None, model=None, optimizer=None, model_parameters=None, training_data=None, lr_scheduler=None, distributed_port=29500, mpu=None, dist_init_required=None, collate_fn=None, config=None, mesh_param=None, config_params=None) -> Tuple[DeepSpeedEngine, Optimizer, DataLoader, LRScheduler]`
Import	`import deepspeed`

I/O Contract

Inputs

Name	Type	Required	Description
model	torch.nn.Module	Yes	SFT model with reward scalar head appended
config	dict or str	Yes	DeepSpeed configuration with ZeRO settings
training_data	torch.utils.data.Dataset	No	Comparison dataset with preferred and rejected pairs
model_parameters	iterable	No	Parameters to optimize
optimizer	Optimizer	No	Custom optimizer (otherwise created from config)

Outputs

Name	Type	Description
engine	DeepSpeedEngine	Wrapped reward model for distributed training
optimizer	Optimizer	Wrapped optimizer instance
dataloader	DataLoader	DataLoader if training_data was provided, otherwise None
lr_scheduler	LRScheduler	Learning rate scheduler if configured, otherwise None

Usage Example

import deepspeed

# RewardModel adds a scalar head to the SFT model
reward_model = RewardModel.from_pretrained("sft_checkpoint/")
rm_config = {
    "train_batch_size": 32,
    "zero_optimization": {"stage": 2},
    "bf16": {"enabled": True}
}
engine, _, _, _ = deepspeed.initialize(
    model=reward_model,
    config=rm_config,
    model_parameters=reward_model.parameters()
)
# Reward model training loop
for batch in rm_dataloader:
    chosen_rewards = engine(batch["chosen_input_ids"])
    rejected_rewards = engine(batch["rejected_input_ids"])
    loss = -torch.log(torch.sigmoid(chosen_rewards - rejected_rewards)).mean()
    engine.backward(loss)
    engine.step()
engine.save_checkpoint("reward_model_checkpoint/")

Related Pages

Principle:Deepspeedai_DeepSpeed_Reward_Model_Training

Knowledge Sources

Last updated: 2026-02-09 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment