Implementation:Axolotl ai cloud Axolotl HFRLTrainerBuilder Build

Knowledge Sources	Axolotl TRL DPO Trainer
Domains	Alignment, Training
Last Updated	2026-02-06 23:00 GMT

Overview

Concrete tool for building and configuring the DPO alignment trainer instance provided by the Axolotl framework.

Description

The HFRLTrainerBuilder class assembles all components for DPO training. The build() method selects the appropriate strategy (DPO, KTO, GRPO) via DPOStrategy or KTOStrategy, configures DPO-specific training arguments (loss type, label smoothing, max length, generation during eval), and returns an AxolotlDPOTrainer instance ready for training.

AxolotlDPOTrainer extends TRL's DPOTrainer with Axolotl-specific mixins for custom optimizers, schedulers, RNG state management, and distributed save handling.

DPOStrategy is a helper class that:

Returns the correct trainer class (AxolotlDPOTrainer)
Returns the correct training args class (AxolotlDPOConfig)
Sets DPO-specific training argument kwargs (loss_type, label_smoothing, etc.)

Usage

This implementation is used internally when cfg.rl is set to "dpo", "ipo", or "simpo". The setup_trainer utility function routes to this builder for RL-type training.

Code Reference

Source Location

Repository: axolotl
File: src/axolotl/core/builders/rl.py (builder), src/axolotl/core/trainers/dpo/trainer.py (trainer), src/axolotl/core/trainers/dpo/__init__.py (strategy)
Lines: rl.py L25-242 (class), L195-242 (build); dpo/trainer.py (AxolotlDPOTrainer); dpo/__init__.py L7-41 (DPOStrategy)

Signature

class HFRLTrainerBuilder:
    """Builder for reinforcement learning trainers (DPO, KTO, GRPO)."""

    def __init__(self, cfg, model, tokenizer, processor=None):
        """
        Args:
            cfg: Training configuration.
            model: Policy model to train.
            tokenizer: Tokenizer.
            processor: Optional multimodal processor.
        """

    def build(self, total_num_steps: int) -> AxolotlDPOTrainer:
        """Build a configured DPO trainer instance.

        Args:
            total_num_steps: Total training steps for scheduler.

        Returns:
            AxolotlDPOTrainer: Configured trainer ready for .train().
        """


class DPOStrategy:
    """Strategy for DPO training argument configuration."""

    @classmethod
    def get_trainer_class(cls):
        return AxolotlDPOTrainer

    @classmethod
    def set_training_args_kwargs(cls, cfg):
        """Return DPO-specific training argument overrides."""


class AxolotlDPOTrainer(
    RngLoaderMixin, SchedulerMixin, OptimizerMixin,
    OptimizerInitMixin, DPOTrainer, DistributedParallelMixin,
):
    """Extended TRL DPOTrainer with Axolotl-specific features."""

Import

from axolotl.core.builders.rl import HFRLTrainerBuilder
from axolotl.core.trainers.dpo import DPOStrategy
from axolotl.core.trainers.dpo.trainer import AxolotlDPOTrainer

I/O Contract

Inputs

Name	Type	Required	Description
cfg	DictDefault	Yes	Config with rl type, dpo_label_smoothing, dpo_generate_during_eval, sequence_len, etc.
model	PreTrainedModel or PeftModel	Yes	Policy model to train
tokenizer	PreTrainedTokenizer	Yes	Tokenizer for data processing
total_num_steps	int	Yes (for build())	Total training steps for scheduler
model_ref	PreTrainedModel or None	No	Reference model (None for LoRA auto-unwrap or ORPO)
peft_config	PeftConfig or None	No	PEFT config for LoRA training

Outputs

Name	Type	Description
trainer	AxolotlDPOTrainer	Configured DPO trainer instance
train() returns	TrainOutput	Training metrics including DPO-specific losses (chosen/rejected rewards)

Usage Examples

Building DPO Trainer

from axolotl.core.builders.rl import HFRLTrainerBuilder

# cfg.rl = "dpo"
builder = HFRLTrainerBuilder(cfg, model, tokenizer)
trainer = builder.build(total_num_steps=1000)

# Set datasets and reference model
trainer.train_dataset = train_dataset
trainer.eval_dataset = eval_dataset

# Execute DPO training
result = trainer.train()

High-Level DPO Training

from axolotl.train import train

# cfg.rl = "dpo" triggers DPO path automatically
model, tokenizer, trainer = train(cfg=cfg, dataset_meta=dataset_meta)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment