Implementation:Allenai Open instruct DPO Tune Cache Main

Component Type	Function (entry point)
Source	`open_instruct/dpo_tune_cache.py` (Lines 115-761)
Repository	Open Instruct
Dependencies	accelerate, deepspeed, transformers, torch, wandb, peft, huggingface_hub, datasets
Last Updated	2026-02-07 00:00 GMT

Overview

Concrete tool for running the complete DPO training pipeline -- from initialization through training to model saving -- provided by the Open Instruct library.

Description

main(args, tc) is the primary entry point for DPO training in Open Instruct. It orchestrates the entire training workflow:

Accelerator Setup:

Initializes HuggingFace Accelerate with optional DeepSpeed ZeRO (stages 0-3), configurable via zero_stage, offload_optimizer, and offload_param.
Supports gradient accumulation with optional per-batch synchronization (sync_each_batch).
Configures W&B tracking when with_tracking=True.

Model and Data:

Loads the pre-trained causal language model with optional flash attention, QLoRA quantization, and LoRA adapters.
Prepares the tokenized preference dataset using the dataset transformation pipeline.
Creates a DataLoader with either standard padding collation or padding-free packing.

Optimization:

Uses AdamW optimizer with optional 8-bit or paged variants (bitsandbytes).
Supports configurable learning rate schedulers (linear, cosine, etc.) with warmup.
Implements gradient clipping via max_grad_norm.

Training Loop:

Caches reference model logprobs before training if the loss type requires them.
For each batch: runs the forward function, computes the DPO loss variant, backpropagates, and updates the model.
Tracks and logs metrics including loss, learning rate, implicit rewards, reward accuracy, reward margin, tokens per second, and MFU.
Supports checkpoint save/resume at configurable step or epoch intervals.

Post-Training:

Saves the final model and tokenizer.
Optionally pushes to HuggingFace Hub with metadata.
Can launch downstream evaluation jobs on Beaker.

Usage

Import and call main() with an ExperimentConfig and TokenizerConfig to run a full DPO training job. Typically invoked via command-line argument parsing in the module's __main__ block.

Code Reference

Source Location

Repository: Open Instruct
File: open_instruct/dpo_tune_cache.py (Lines 115-761)

Signature

def main(args: dpo_utils.ExperimentConfig, tc: TokenizerConfig):
    """Run the full DPO training pipeline.

    Args:
        args: Complete experiment configuration including model, training,
              DPO, dataset, LoRA, logging, hub, checkpoint, and eval settings.
        tc: Tokenizer configuration for loading and configuring the tokenizer.
    """
    ...

Import

from open_instruct.dpo_tune_cache import main

I/O Contract

Inputs

Parameter	Type	Description
`args`	`dpo_utils.ExperimentConfig`	Full experiment configuration dataclass. Key fields include: `loss_type`: DPO loss variant (`dpo`, `dpo_norm`, `simpo`, `wpo`) `beta`: Temperature parameter for the DPO loss `per_device_train_batch_size`: Batch size per GPU `gradient_accumulation_steps`: Steps to accumulate before update `learning_rate`: Initial learning rate for AdamW `max_grad_norm`: Gradient clipping threshold (-1 to disable) `num_epochs`: Number of training epochs `model_name_or_path`: Pre-trained model identifier
`tc`	`TokenizerConfig`	Tokenizer configuration including tokenizer name/path and revision.

Outputs

Output	Description
Saved model	Final model weights and tokenizer saved to `args.output_dir`.
HuggingFace Hub	If `push_to_hub=True`, model pushed to the configured HF repository.
W&B logs	If `with_tracking=True`, training metrics logged to Weights & Biases.
Checkpoints	Intermediate checkpoints saved at configured intervals in `args.output_dir`.

Usage Examples

from open_instruct.dpo_tune_cache import main
from open_instruct.dpo_utils import ExperimentConfig
from open_instruct.dataset_transformation import TokenizerConfig

# Configure the experiment
args = ExperimentConfig(
    model_name_or_path="allenai/tulu-2-7b",
    loss_type="dpo",
    beta=0.1,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=5e-7,
    num_epochs=1,
    max_grad_norm=1.0,
    output_dir="output/dpo_experiment",
)
tc = TokenizerConfig()

# Run training
main(args, tc)

Alternatively, launch from the command line:

accelerate launch --config_file configs/ds_configs/deepspeed_zero2.yaml \
    open_instruct/dpo_tune_cache.py \
    --model_name_or_path allenai/tulu-2-7b \
    --loss_type dpo \
    --beta 0.1 \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --learning_rate 5e-7

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment