Implementation:Allenai Open instruct DPO Tune Cache Main
Appearance
| Component Type | Function (entry point) |
|---|---|
| Source | open_instruct/dpo_tune_cache.py (Lines 115-761)
|
| Repository | Open Instruct |
| Dependencies | accelerate, deepspeed, transformers, torch, wandb, peft, huggingface_hub, datasets |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Concrete tool for running the complete DPO training pipeline -- from initialization through training to model saving -- provided by the Open Instruct library.
Description
main(args, tc) is the primary entry point for DPO training in Open Instruct. It orchestrates the entire training workflow:
Accelerator Setup:
- Initializes HuggingFace Accelerate with optional DeepSpeed ZeRO (stages 0-3), configurable via
zero_stage,offload_optimizer, andoffload_param. - Supports gradient accumulation with optional per-batch synchronization (
sync_each_batch). - Configures W&B tracking when
with_tracking=True.
Model and Data:
- Loads the pre-trained causal language model with optional flash attention, QLoRA quantization, and LoRA adapters.
- Prepares the tokenized preference dataset using the dataset transformation pipeline.
- Creates a DataLoader with either standard padding collation or padding-free packing.
Optimization:
- Uses AdamW optimizer with optional 8-bit or paged variants (bitsandbytes).
- Supports configurable learning rate schedulers (linear, cosine, etc.) with warmup.
- Implements gradient clipping via
max_grad_norm.
Training Loop:
- Caches reference model logprobs before training if the loss type requires them.
- For each batch: runs the forward function, computes the DPO loss variant, backpropagates, and updates the model.
- Tracks and logs metrics including loss, learning rate, implicit rewards, reward accuracy, reward margin, tokens per second, and MFU.
- Supports checkpoint save/resume at configurable step or epoch intervals.
Post-Training:
- Saves the final model and tokenizer.
- Optionally pushes to HuggingFace Hub with metadata.
- Can launch downstream evaluation jobs on Beaker.
Usage
Import and call main() with an ExperimentConfig and TokenizerConfig to run a full DPO training job. Typically invoked via command-line argument parsing in the module's __main__ block.
Code Reference
Source Location
- Repository: Open Instruct
- File:
open_instruct/dpo_tune_cache.py(Lines 115-761)
Signature
def main(args: dpo_utils.ExperimentConfig, tc: TokenizerConfig):
"""Run the full DPO training pipeline.
Args:
args: Complete experiment configuration including model, training,
DPO, dataset, LoRA, logging, hub, checkpoint, and eval settings.
tc: Tokenizer configuration for loading and configuring the tokenizer.
"""
...
Import
from open_instruct.dpo_tune_cache import main
I/O Contract
Inputs
| Parameter | Type | Description |
|---|---|---|
args |
dpo_utils.ExperimentConfig |
Full experiment configuration dataclass. Key fields include:
|
tc |
TokenizerConfig |
Tokenizer configuration including tokenizer name/path and revision. |
Outputs
| Output | Description |
|---|---|
| Saved model | Final model weights and tokenizer saved to args.output_dir.
|
| HuggingFace Hub | If push_to_hub=True, model pushed to the configured HF repository.
|
| W&B logs | If with_tracking=True, training metrics logged to Weights & Biases.
|
| Checkpoints | Intermediate checkpoints saved at configured intervals in args.output_dir.
|
Usage Examples
from open_instruct.dpo_tune_cache import main
from open_instruct.dpo_utils import ExperimentConfig
from open_instruct.dataset_transformation import TokenizerConfig
# Configure the experiment
args = ExperimentConfig(
model_name_or_path="allenai/tulu-2-7b",
loss_type="dpo",
beta=0.1,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=5e-7,
num_epochs=1,
max_grad_norm=1.0,
output_dir="output/dpo_experiment",
)
tc = TokenizerConfig()
# Run training
main(args, tc)
Alternatively, launch from the command line:
accelerate launch --config_file configs/ds_configs/deepspeed_zero2.yaml \
open_instruct/dpo_tune_cache.py \
--model_name_or_path allenai/tulu-2-7b \
--loss_type dpo \
--beta 0.1 \
--per_device_train_batch_size 4 \
--gradient_accumulation_steps 4 \
--learning_rate 5e-7
Related Pages
Implements Principle
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment