Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Allenai Open instruct DPO Tune Cache Main

From Leeroopedia


Component Type Function (entry point)
Source open_instruct/dpo_tune_cache.py (Lines 115-761)
Repository Open Instruct
Dependencies accelerate, deepspeed, transformers, torch, wandb, peft, huggingface_hub, datasets
Last Updated 2026-02-07 00:00 GMT

Overview

Concrete tool for running the complete DPO training pipeline -- from initialization through training to model saving -- provided by the Open Instruct library.

Description

main(args, tc) is the primary entry point for DPO training in Open Instruct. It orchestrates the entire training workflow:

Accelerator Setup:

  • Initializes HuggingFace Accelerate with optional DeepSpeed ZeRO (stages 0-3), configurable via zero_stage, offload_optimizer, and offload_param.
  • Supports gradient accumulation with optional per-batch synchronization (sync_each_batch).
  • Configures W&B tracking when with_tracking=True.

Model and Data:

  • Loads the pre-trained causal language model with optional flash attention, QLoRA quantization, and LoRA adapters.
  • Prepares the tokenized preference dataset using the dataset transformation pipeline.
  • Creates a DataLoader with either standard padding collation or padding-free packing.

Optimization:

  • Uses AdamW optimizer with optional 8-bit or paged variants (bitsandbytes).
  • Supports configurable learning rate schedulers (linear, cosine, etc.) with warmup.
  • Implements gradient clipping via max_grad_norm.

Training Loop:

  • Caches reference model logprobs before training if the loss type requires them.
  • For each batch: runs the forward function, computes the DPO loss variant, backpropagates, and updates the model.
  • Tracks and logs metrics including loss, learning rate, implicit rewards, reward accuracy, reward margin, tokens per second, and MFU.
  • Supports checkpoint save/resume at configurable step or epoch intervals.

Post-Training:

  • Saves the final model and tokenizer.
  • Optionally pushes to HuggingFace Hub with metadata.
  • Can launch downstream evaluation jobs on Beaker.

Usage

Import and call main() with an ExperimentConfig and TokenizerConfig to run a full DPO training job. Typically invoked via command-line argument parsing in the module's __main__ block.

Code Reference

Source Location

  • Repository: Open Instruct
  • File: open_instruct/dpo_tune_cache.py (Lines 115-761)

Signature

def main(args: dpo_utils.ExperimentConfig, tc: TokenizerConfig):
    """Run the full DPO training pipeline.

    Args:
        args: Complete experiment configuration including model, training,
              DPO, dataset, LoRA, logging, hub, checkpoint, and eval settings.
        tc: Tokenizer configuration for loading and configuring the tokenizer.
    """
    ...

Import

from open_instruct.dpo_tune_cache import main

I/O Contract

Inputs

Parameter Type Description
args dpo_utils.ExperimentConfig Full experiment configuration dataclass. Key fields include:
  • loss_type: DPO loss variant (dpo, dpo_norm, simpo, wpo)
  • beta: Temperature parameter for the DPO loss
  • per_device_train_batch_size: Batch size per GPU
  • gradient_accumulation_steps: Steps to accumulate before update
  • learning_rate: Initial learning rate for AdamW
  • max_grad_norm: Gradient clipping threshold (-1 to disable)
  • num_epochs: Number of training epochs
  • model_name_or_path: Pre-trained model identifier
tc TokenizerConfig Tokenizer configuration including tokenizer name/path and revision.

Outputs

Output Description
Saved model Final model weights and tokenizer saved to args.output_dir.
HuggingFace Hub If push_to_hub=True, model pushed to the configured HF repository.
W&B logs If with_tracking=True, training metrics logged to Weights & Biases.
Checkpoints Intermediate checkpoints saved at configured intervals in args.output_dir.

Usage Examples

from open_instruct.dpo_tune_cache import main
from open_instruct.dpo_utils import ExperimentConfig
from open_instruct.dataset_transformation import TokenizerConfig

# Configure the experiment
args = ExperimentConfig(
    model_name_or_path="allenai/tulu-2-7b",
    loss_type="dpo",
    beta=0.1,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=5e-7,
    num_epochs=1,
    max_grad_norm=1.0,
    output_dir="output/dpo_experiment",
)
tc = TokenizerConfig()

# Run training
main(args, tc)

Alternatively, launch from the command line:

accelerate launch --config_file configs/ds_configs/deepspeed_zero2.yaml \
    open_instruct/dpo_tune_cache.py \
    --model_name_or_path allenai/tulu-2-7b \
    --loss_type dpo \
    --beta 0.1 \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --learning_rate 5e-7

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment