Implementation:Axolotl ai cloud Axolotl HFRLTrainerBuilder Build
| Knowledge Sources | |
|---|---|
| Domains | Alignment, Training |
| Last Updated | 2026-02-06 23:00 GMT |
Overview
Concrete tool for building and configuring the DPO alignment trainer instance provided by the Axolotl framework.
Description
The HFRLTrainerBuilder class assembles all components for DPO training. The build() method selects the appropriate strategy (DPO, KTO, GRPO) via DPOStrategy or KTOStrategy, configures DPO-specific training arguments (loss type, label smoothing, max length, generation during eval), and returns an AxolotlDPOTrainer instance ready for training.
AxolotlDPOTrainer extends TRL's DPOTrainer with Axolotl-specific mixins for custom optimizers, schedulers, RNG state management, and distributed save handling.
DPOStrategy is a helper class that:
- Returns the correct trainer class (AxolotlDPOTrainer)
- Returns the correct training args class (AxolotlDPOConfig)
- Sets DPO-specific training argument kwargs (loss_type, label_smoothing, etc.)
Usage
This implementation is used internally when cfg.rl is set to "dpo", "ipo", or "simpo". The setup_trainer utility function routes to this builder for RL-type training.
Code Reference
Source Location
- Repository: axolotl
- File: src/axolotl/core/builders/rl.py (builder), src/axolotl/core/trainers/dpo/trainer.py (trainer), src/axolotl/core/trainers/dpo/__init__.py (strategy)
- Lines: rl.py L25-242 (class), L195-242 (build); dpo/trainer.py (AxolotlDPOTrainer); dpo/__init__.py L7-41 (DPOStrategy)
Signature
class HFRLTrainerBuilder:
"""Builder for reinforcement learning trainers (DPO, KTO, GRPO)."""
def __init__(self, cfg, model, tokenizer, processor=None):
"""
Args:
cfg: Training configuration.
model: Policy model to train.
tokenizer: Tokenizer.
processor: Optional multimodal processor.
"""
def build(self, total_num_steps: int) -> AxolotlDPOTrainer:
"""Build a configured DPO trainer instance.
Args:
total_num_steps: Total training steps for scheduler.
Returns:
AxolotlDPOTrainer: Configured trainer ready for .train().
"""
class DPOStrategy:
"""Strategy for DPO training argument configuration."""
@classmethod
def get_trainer_class(cls):
return AxolotlDPOTrainer
@classmethod
def set_training_args_kwargs(cls, cfg):
"""Return DPO-specific training argument overrides."""
class AxolotlDPOTrainer(
RngLoaderMixin, SchedulerMixin, OptimizerMixin,
OptimizerInitMixin, DPOTrainer, DistributedParallelMixin,
):
"""Extended TRL DPOTrainer with Axolotl-specific features."""
Import
from axolotl.core.builders.rl import HFRLTrainerBuilder
from axolotl.core.trainers.dpo import DPOStrategy
from axolotl.core.trainers.dpo.trainer import AxolotlDPOTrainer
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| cfg | DictDefault | Yes | Config with rl type, dpo_label_smoothing, dpo_generate_during_eval, sequence_len, etc. |
| model | PreTrainedModel or PeftModel | Yes | Policy model to train |
| tokenizer | PreTrainedTokenizer | Yes | Tokenizer for data processing |
| total_num_steps | int | Yes (for build()) | Total training steps for scheduler |
| model_ref | PreTrainedModel or None | No | Reference model (None for LoRA auto-unwrap or ORPO) |
| peft_config | PeftConfig or None | No | PEFT config for LoRA training |
Outputs
| Name | Type | Description |
|---|---|---|
| trainer | AxolotlDPOTrainer | Configured DPO trainer instance |
| train() returns | TrainOutput | Training metrics including DPO-specific losses (chosen/rejected rewards) |
Usage Examples
Building DPO Trainer
from axolotl.core.builders.rl import HFRLTrainerBuilder
# cfg.rl = "dpo"
builder = HFRLTrainerBuilder(cfg, model, tokenizer)
trainer = builder.build(total_num_steps=1000)
# Set datasets and reference model
trainer.train_dataset = train_dataset
trainer.eval_dataset = eval_dataset
# Execute DPO training
result = trainer.train()
High-Level DPO Training
from axolotl.train import train
# cfg.rl = "dpo" triggers DPO path automatically
model, tokenizer, trainer = train(cfg=cfg, dataset_meta=dataset_meta)