Implementation:Deepspeedai DeepSpeed Initialize For RM
Overview
Concrete tool for initializing a DeepSpeed engine for reward model training in the RLHF pipeline provided by the DeepSpeed library.
Description
Uses deepspeed.initialize() with a standard DeepSpeedEngine for reward model training. The reward model is typically the SFT model with a linear head added for scalar reward prediction. Configuration uses ZeRO Stage 2 or 3 with the same mixed precision settings as SFT. Since reward model training involves only forward and backward passes without text generation, the hybrid engine is not required.
The reward model architecture adds a scalar projection head to the base SFT model. During initialization, deepspeed.initialize() wraps this modified model in a DeepSpeedEngine that handles ZeRO-partitioned optimizer states, gradient communication across data-parallel ranks, and mixed-precision training. The resulting engine supports the standard training loop pattern where comparison pairs are processed, the Bradley-Terry loss is computed, and gradients are propagated through both the scalar head and the base transformer.
Code Reference
| Property | Value |
|---|---|
| Repository | https://github.com/deepspeedai/DeepSpeed |
| File | deepspeed/__init__.py (L80-252), deepspeed/runtime/engine.py (L206-420)
|
| Signature | def initialize(args=None, model=None, optimizer=None, model_parameters=None, training_data=None, lr_scheduler=None, distributed_port=29500, mpu=None, dist_init_required=None, collate_fn=None, config=None, mesh_param=None, config_params=None) -> Tuple[DeepSpeedEngine, Optimizer, DataLoader, LRScheduler]
|
| Import | import deepspeed
|
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model | torch.nn.Module | Yes | SFT model with reward scalar head appended |
| config | dict or str | Yes | DeepSpeed configuration with ZeRO settings |
| training_data | torch.utils.data.Dataset | No | Comparison dataset with preferred and rejected pairs |
| model_parameters | iterable | No | Parameters to optimize |
| optimizer | Optimizer | No | Custom optimizer (otherwise created from config) |
Outputs
| Name | Type | Description |
|---|---|---|
| engine | DeepSpeedEngine | Wrapped reward model for distributed training |
| optimizer | Optimizer | Wrapped optimizer instance |
| dataloader | DataLoader | DataLoader if training_data was provided, otherwise None |
| lr_scheduler | LRScheduler | Learning rate scheduler if configured, otherwise None |
Usage Example
import deepspeed
# RewardModel adds a scalar head to the SFT model
reward_model = RewardModel.from_pretrained("sft_checkpoint/")
rm_config = {
"train_batch_size": 32,
"zero_optimization": {"stage": 2},
"bf16": {"enabled": True}
}
engine, _, _, _ = deepspeed.initialize(
model=reward_model,
config=rm_config,
model_parameters=reward_model.parameters()
)
# Reward model training loop
for batch in rm_dataloader:
chosen_rewards = engine(batch["chosen_input_ids"])
rejected_rewards = engine(batch["rejected_input_ids"])
loss = -torch.log(torch.sigmoid(chosen_rewards - rejected_rewards)).mean()
engine.backward(loss)
engine.step()
engine.save_checkpoint("reward_model_checkpoint/")
Related Pages
Knowledge Sources
- https://github.com/deepspeedai/DeepSpeed
- https://arxiv.org/abs/2203.02155
- https://arxiv.org/abs/1706.03741
Last updated: 2026-02-09 00:00 GMT