Implementation:Allenai Open instruct Reward Modeling Main
| Knowledge Sources | |
|---|---|
| Domains | Reinforcement Learning from Human Feedback, Reward Modeling, Preference Learning, Distributed Training |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Concrete tool for end-to-end reward model training using the Bradley-Terry preference loss, provided by Open Instruct.
Description
The main() function in reward_modeling.py orchestrates the complete reward model training pipeline. It handles:
- Tokenizer and model setup: Loads the tokenizer and initializes the reward model from a pre-trained checkpoint with a single-output score head.
- Dataset preparation: Loads and caches preference datasets, creates data loaders with appropriate collation.
- Distributed training infrastructure: Configures HuggingFace Accelerate for multi-GPU and multi-node training with gradient accumulation.
- Training loop: Iterates over preference pairs, computes Bradley-Terry loss, updates model parameters, and logs metrics.
- Evaluation: Periodically evaluates the model on a held-out preference dataset.
- Model saving and publishing: Saves the trained model and optionally pushes it to HuggingFace Hub.
The training loop concatenates chosen and rejected sequences into a single batch, performs a single forward pass through the reward model, then splits the resulting rewards to compute the preference loss. This is more efficient than separate forward passes for chosen and rejected sequences because it maximizes GPU utilization and allows batch normalization-like effects in the transformer layers.
Usage
Use this function as the main entry point for reward model training. It is typically invoked via command-line argument parsing but can also be called programmatically by constructing the appropriate dataclass arguments.
Code Reference
Source Location
- Repository: Open Instruct
- File:
open_instruct/reward_modeling.py, lines 165-419
Signature
def main(args: Args, tc: TokenizerConfig, model_config: ModelConfig) -> None:
Import
from open_instruct.reward_modeling import main, Args
from open_instruct.dataset_transformation import TokenizerConfig
from open_instruct.model_utils import ModelConfig
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| args | Args | Yes | Training configuration dataclass containing hyperparameters and experiment settings. |
| tc | TokenizerConfig | Yes | Tokenizer configuration including tokenizer name/path, chat template, and special token settings. |
| model_config | ModelConfig | Yes | Model configuration including model checkpoint path, revision, attention implementation, and gradient checkpointing settings. |
Key Parameters in Args
| Parameter | Type | Default | Description |
|---|---|---|---|
| per_device_train_batch_size | int | 1 | Forward batch size per GPU (micro batch size). |
| gradient_accumulation_steps | int | 8 | Number of micro-batches to accumulate before an optimizer step. |
| learning_rate | float | 2e-5 | Initial learning rate for the AdamW optimizer. |
| num_train_epochs | int | 1 | Number of full passes over the training dataset. |
| lr_scheduler_type | str | "linear" | Learning rate scheduler type. Options: linear, cosine, cosine_with_restarts, polynomial, constant, constant_with_warmup. |
| warm_up_steps | int | 0 | Number of warmup steps for the learning rate scheduler. |
| max_token_length | int | 512 | Maximum token length for sequences in the preference dataset. |
| max_prompt_token_length | int | 256 | Maximum token length for the prompt portion of sequences. |
| num_evals | int | 1 | Number of evaluation runs distributed throughout training. |
| seed | int | 1 | Random seed for reproducibility. |
| with_tracking | bool | False | Whether to log metrics to Weights & Biases. |
Outputs
| Name | Type | Description |
|---|---|---|
| Trained model | Saved to disk | The trained reward model is saved to args.output_dir using save_with_accelerate. Includes model weights, tokenizer, and configuration.
|
| Metrics | Logged to W&B/TensorBoard | Training metrics (accuracy, loss, chosen/rejected scores, reward margin, learning rate) are logged at each optimizer step. Evaluation metrics are logged at configured intervals. |
| HuggingFace Hub upload | Optional | If args.push_to_hub is True, the trained model is uploaded to the specified HuggingFace Hub repository.
|
Usage Examples
Command-Line Usage
python open_instruct/reward_modeling.py \
--dataset_mixer_list allenai/tulu-3-wildchat-reused-on-policy-8b 1.0 \
--model_name_or_path allenai/tulu-2-7b \
--num_labels 1 \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 8 \
--learning_rate 2e-5 \
--num_train_epochs 1 \
--lr_scheduler_type linear \
--output_dir output/reward_model \
--with_tracking
Programmatic Usage
from open_instruct.reward_modeling import main, Args
from open_instruct.dataset_transformation import TokenizerConfig
from open_instruct.model_utils import ModelConfig
args = Args(
dataset_mixer_list=["allenai/tulu-3-wildchat-reused-on-policy-8b", "1.0"],
per_device_train_batch_size=1,
gradient_accumulation_steps=8,
learning_rate=2e-5,
num_train_epochs=1,
output_dir="output/reward_model",
)
tc = TokenizerConfig(tokenizer_name_or_path="allenai/tulu-2-7b")
model_config = ModelConfig(model_name_or_path="allenai/tulu-2-7b")
main(args, tc, model_config)
Training Loop Detail
The core training loop performs the following operations at each step:
# Concatenate chosen and rejected into a single batch for efficient forward pass
query_responses = torch.cat(
(data[CHOSEN_INPUT_IDS_KEY], data[REJECTED_INPUT_IDS_KEY]), dim=0
)
# Forward pass: get scalar rewards for all sequences
_, predicted_reward, _ = get_reward(
model, query_responses, tokenizer.pad_token_id, 0
)
# Split rewards back into chosen and rejected
chosen_reward = predicted_reward[:data[CHOSEN_INPUT_IDS_KEY].shape[0]]
rejected_reward = predicted_reward[data[CHOSEN_INPUT_IDS_KEY].shape[0]:]
# Compute Bradley-Terry loss
accuracy = (chosen_reward > rejected_reward).float().mean()
loss = -F.logsigmoid(chosen_reward - rejected_reward).mean()
# Backward pass and optimizer step
accelerator.backward(loss)
optimizer.step()
optimizer.zero_grad()
Dependencies
| Package | Module | Purpose |
|---|---|---|
| accelerate | Accelerator | Distributed training orchestration (multi-GPU, multi-node) |
| deepspeed | deepspeed | ZeRO optimizer state sharding for memory-efficient training |
| transformers | AutoModelForSequenceClassification | Reward model architecture (transformer + score head) |
| transformers | get_scheduler | Learning rate scheduling (linear, cosine, etc.) |
| torch | torch.nn.functional | Loss computation (F.logsigmoid)
|
| torch | torch.optim | AdamW optimizer |
| wandb | wandb | Experiment tracking and metric logging |
| open_instruct | dataset_transformation | Dataset loading, tokenization, and collation |
| open_instruct | model_utils | Reward extraction (get_reward), model saving, dropout disabling
|
| open_instruct | reward_modeling_eval | Periodic evaluation via the evaluate function
|
Related Pages
Implements Principle
- Principle:Allenai_Open_instruct_Reward_Model_Training
- Environment:Allenai_Open_instruct_CUDA_GPU_Training