Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Allenai Open instruct Reward Modeling Main

From Leeroopedia


Knowledge Sources
Domains Reinforcement Learning from Human Feedback, Reward Modeling, Preference Learning, Distributed Training
Last Updated 2026-02-07 00:00 GMT

Overview

Concrete tool for end-to-end reward model training using the Bradley-Terry preference loss, provided by Open Instruct.

Description

The main() function in reward_modeling.py orchestrates the complete reward model training pipeline. It handles:

  1. Tokenizer and model setup: Loads the tokenizer and initializes the reward model from a pre-trained checkpoint with a single-output score head.
  2. Dataset preparation: Loads and caches preference datasets, creates data loaders with appropriate collation.
  3. Distributed training infrastructure: Configures HuggingFace Accelerate for multi-GPU and multi-node training with gradient accumulation.
  4. Training loop: Iterates over preference pairs, computes Bradley-Terry loss, updates model parameters, and logs metrics.
  5. Evaluation: Periodically evaluates the model on a held-out preference dataset.
  6. Model saving and publishing: Saves the trained model and optionally pushes it to HuggingFace Hub.

The training loop concatenates chosen and rejected sequences into a single batch, performs a single forward pass through the reward model, then splits the resulting rewards to compute the preference loss. This is more efficient than separate forward passes for chosen and rejected sequences because it maximizes GPU utilization and allows batch normalization-like effects in the transformer layers.

Usage

Use this function as the main entry point for reward model training. It is typically invoked via command-line argument parsing but can also be called programmatically by constructing the appropriate dataclass arguments.

Code Reference

Source Location

  • Repository: Open Instruct
  • File: open_instruct/reward_modeling.py, lines 165-419

Signature

def main(args: Args, tc: TokenizerConfig, model_config: ModelConfig) -> None:

Import

from open_instruct.reward_modeling import main, Args
from open_instruct.dataset_transformation import TokenizerConfig
from open_instruct.model_utils import ModelConfig

I/O Contract

Inputs

Name Type Required Description
args Args Yes Training configuration dataclass containing hyperparameters and experiment settings.
tc TokenizerConfig Yes Tokenizer configuration including tokenizer name/path, chat template, and special token settings.
model_config ModelConfig Yes Model configuration including model checkpoint path, revision, attention implementation, and gradient checkpointing settings.

Key Parameters in Args

Parameter Type Default Description
per_device_train_batch_size int 1 Forward batch size per GPU (micro batch size).
gradient_accumulation_steps int 8 Number of micro-batches to accumulate before an optimizer step.
learning_rate float 2e-5 Initial learning rate for the AdamW optimizer.
num_train_epochs int 1 Number of full passes over the training dataset.
lr_scheduler_type str "linear" Learning rate scheduler type. Options: linear, cosine, cosine_with_restarts, polynomial, constant, constant_with_warmup.
warm_up_steps int 0 Number of warmup steps for the learning rate scheduler.
max_token_length int 512 Maximum token length for sequences in the preference dataset.
max_prompt_token_length int 256 Maximum token length for the prompt portion of sequences.
num_evals int 1 Number of evaluation runs distributed throughout training.
seed int 1 Random seed for reproducibility.
with_tracking bool False Whether to log metrics to Weights & Biases.

Outputs

Name Type Description
Trained model Saved to disk The trained reward model is saved to args.output_dir using save_with_accelerate. Includes model weights, tokenizer, and configuration.
Metrics Logged to W&B/TensorBoard Training metrics (accuracy, loss, chosen/rejected scores, reward margin, learning rate) are logged at each optimizer step. Evaluation metrics are logged at configured intervals.
HuggingFace Hub upload Optional If args.push_to_hub is True, the trained model is uploaded to the specified HuggingFace Hub repository.

Usage Examples

Command-Line Usage

python open_instruct/reward_modeling.py \
    --dataset_mixer_list allenai/tulu-3-wildchat-reused-on-policy-8b 1.0 \
    --model_name_or_path allenai/tulu-2-7b \
    --num_labels 1 \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --learning_rate 2e-5 \
    --num_train_epochs 1 \
    --lr_scheduler_type linear \
    --output_dir output/reward_model \
    --with_tracking

Programmatic Usage

from open_instruct.reward_modeling import main, Args
from open_instruct.dataset_transformation import TokenizerConfig
from open_instruct.model_utils import ModelConfig

args = Args(
    dataset_mixer_list=["allenai/tulu-3-wildchat-reused-on-policy-8b", "1.0"],
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    learning_rate=2e-5,
    num_train_epochs=1,
    output_dir="output/reward_model",
)
tc = TokenizerConfig(tokenizer_name_or_path="allenai/tulu-2-7b")
model_config = ModelConfig(model_name_or_path="allenai/tulu-2-7b")

main(args, tc, model_config)

Training Loop Detail

The core training loop performs the following operations at each step:

# Concatenate chosen and rejected into a single batch for efficient forward pass
query_responses = torch.cat(
    (data[CHOSEN_INPUT_IDS_KEY], data[REJECTED_INPUT_IDS_KEY]), dim=0
)

# Forward pass: get scalar rewards for all sequences
_, predicted_reward, _ = get_reward(
    model, query_responses, tokenizer.pad_token_id, 0
)

# Split rewards back into chosen and rejected
chosen_reward = predicted_reward[:data[CHOSEN_INPUT_IDS_KEY].shape[0]]
rejected_reward = predicted_reward[data[CHOSEN_INPUT_IDS_KEY].shape[0]:]

# Compute Bradley-Terry loss
accuracy = (chosen_reward > rejected_reward).float().mean()
loss = -F.logsigmoid(chosen_reward - rejected_reward).mean()

# Backward pass and optimizer step
accelerator.backward(loss)
optimizer.step()
optimizer.zero_grad()

Dependencies

Package Module Purpose
accelerate Accelerator Distributed training orchestration (multi-GPU, multi-node)
deepspeed deepspeed ZeRO optimizer state sharding for memory-efficient training
transformers AutoModelForSequenceClassification Reward model architecture (transformer + score head)
transformers get_scheduler Learning rate scheduling (linear, cosine, etc.)
torch torch.nn.functional Loss computation (F.logsigmoid)
torch torch.optim AdamW optimizer
wandb wandb Experiment tracking and metric logging
open_instruct dataset_transformation Dataset loading, tokenization, and collation
open_instruct model_utils Reward extraction (get_reward), model saving, dropout disabling
open_instruct reward_modeling_eval Periodic evaluation via the evaluate function

Related Pages

Implements Principle

Related Implementations

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment