Implementation:Allenai Open instruct Reward Modeling Main

Knowledge Sources	Open Instruct
Domains	Reinforcement Learning from Human Feedback, Reward Modeling, Preference Learning, Distributed Training
Last Updated	2026-02-07 00:00 GMT

Overview

Concrete tool for end-to-end reward model training using the Bradley-Terry preference loss, provided by Open Instruct.

Description

The main() function in reward_modeling.py orchestrates the complete reward model training pipeline. It handles:

Tokenizer and model setup: Loads the tokenizer and initializes the reward model from a pre-trained checkpoint with a single-output score head.
Dataset preparation: Loads and caches preference datasets, creates data loaders with appropriate collation.
Distributed training infrastructure: Configures HuggingFace Accelerate for multi-GPU and multi-node training with gradient accumulation.
Training loop: Iterates over preference pairs, computes Bradley-Terry loss, updates model parameters, and logs metrics.
Evaluation: Periodically evaluates the model on a held-out preference dataset.
Model saving and publishing: Saves the trained model and optionally pushes it to HuggingFace Hub.

The training loop concatenates chosen and rejected sequences into a single batch, performs a single forward pass through the reward model, then splits the resulting rewards to compute the preference loss. This is more efficient than separate forward passes for chosen and rejected sequences because it maximizes GPU utilization and allows batch normalization-like effects in the transformer layers.

Usage

Use this function as the main entry point for reward model training. It is typically invoked via command-line argument parsing but can also be called programmatically by constructing the appropriate dataclass arguments.

Code Reference

Source Location

Repository: Open Instruct
File: open_instruct/reward_modeling.py, lines 165-419

Signature

def main(args: Args, tc: TokenizerConfig, model_config: ModelConfig) -> None:

Import

from open_instruct.reward_modeling import main, Args
from open_instruct.dataset_transformation import TokenizerConfig
from open_instruct.model_utils import ModelConfig

I/O Contract

Inputs

Name	Type	Required	Description
args	Args	Yes	Training configuration dataclass containing hyperparameters and experiment settings.
tc	TokenizerConfig	Yes	Tokenizer configuration including tokenizer name/path, chat template, and special token settings.
model_config	ModelConfig	Yes	Model configuration including model checkpoint path, revision, attention implementation, and gradient checkpointing settings.

Key Parameters in Args

Parameter	Type	Default	Description
per_device_train_batch_size	int	1	Forward batch size per GPU (micro batch size).
gradient_accumulation_steps	int	8	Number of micro-batches to accumulate before an optimizer step.
learning_rate	float	2e-5	Initial learning rate for the AdamW optimizer.
num_train_epochs	int	1	Number of full passes over the training dataset.
lr_scheduler_type	str	"linear"	Learning rate scheduler type. Options: linear, cosine, cosine_with_restarts, polynomial, constant, constant_with_warmup.
warm_up_steps	int	0	Number of warmup steps for the learning rate scheduler.
max_token_length	int	512	Maximum token length for sequences in the preference dataset.
max_prompt_token_length	int	256	Maximum token length for the prompt portion of sequences.
num_evals	int	1	Number of evaluation runs distributed throughout training.
seed	int	1	Random seed for reproducibility.
with_tracking	bool	False	Whether to log metrics to Weights & Biases.

Outputs

Name	Type	Description
Trained model	Saved to disk	The trained reward model is saved to `args.output_dir` using `save_with_accelerate`. Includes model weights, tokenizer, and configuration.
Metrics	Logged to W&B/TensorBoard	Training metrics (accuracy, loss, chosen/rejected scores, reward margin, learning rate) are logged at each optimizer step. Evaluation metrics are logged at configured intervals.
HuggingFace Hub upload	Optional	If `args.push_to_hub` is True, the trained model is uploaded to the specified HuggingFace Hub repository.

Usage Examples

Command-Line Usage

python open_instruct/reward_modeling.py \
    --dataset_mixer_list allenai/tulu-3-wildchat-reused-on-policy-8b 1.0 \
    --model_name_or_path allenai/tulu-2-7b \
    --num_labels 1 \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --learning_rate 2e-5 \
    --num_train_epochs 1 \
    --lr_scheduler_type linear \
    --output_dir output/reward_model \
    --with_tracking

Programmatic Usage

from open_instruct.reward_modeling import main, Args
from open_instruct.dataset_transformation import TokenizerConfig
from open_instruct.model_utils import ModelConfig

args = Args(
    dataset_mixer_list=["allenai/tulu-3-wildchat-reused-on-policy-8b", "1.0"],
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    learning_rate=2e-5,
    num_train_epochs=1,
    output_dir="output/reward_model",
)
tc = TokenizerConfig(tokenizer_name_or_path="allenai/tulu-2-7b")
model_config = ModelConfig(model_name_or_path="allenai/tulu-2-7b")

main(args, tc, model_config)

Training Loop Detail

The core training loop performs the following operations at each step:

# Concatenate chosen and rejected into a single batch for efficient forward pass
query_responses = torch.cat(
    (data[CHOSEN_INPUT_IDS_KEY], data[REJECTED_INPUT_IDS_KEY]), dim=0
)

# Forward pass: get scalar rewards for all sequences
_, predicted_reward, _ = get_reward(
    model, query_responses, tokenizer.pad_token_id, 0
)

# Split rewards back into chosen and rejected
chosen_reward = predicted_reward[:data[CHOSEN_INPUT_IDS_KEY].shape[0]]
rejected_reward = predicted_reward[data[CHOSEN_INPUT_IDS_KEY].shape[0]:]

# Compute Bradley-Terry loss
accuracy = (chosen_reward > rejected_reward).float().mean()
loss = -F.logsigmoid(chosen_reward - rejected_reward).mean()

# Backward pass and optimizer step
accelerator.backward(loss)
optimizer.step()
optimizer.zero_grad()

Dependencies

Package	Module	Purpose
accelerate	Accelerator	Distributed training orchestration (multi-GPU, multi-node)
deepspeed	deepspeed	ZeRO optimizer state sharding for memory-efficient training
transformers	AutoModelForSequenceClassification	Reward model architecture (transformer + score head)
transformers	get_scheduler	Learning rate scheduling (linear, cosine, etc.)
torch	torch.nn.functional	Loss computation (`F.logsigmoid`)
torch	torch.optim	AdamW optimizer
wandb	wandb	Experiment tracking and metric logging
open_instruct	dataset_transformation	Dataset loading, tokenization, and collation
open_instruct	model_utils	Reward extraction (`get_reward`), model saving, dropout disabling
open_instruct	reward_modeling_eval	Periodic evaluation via the `evaluate` function

Related Pages

Implements Principle

Related Implementations

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment