Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Axolotl ai cloud Axolotl HFRLTrainerBuilder Build

From Leeroopedia


Knowledge Sources
Domains Alignment, Training
Last Updated 2026-02-06 23:00 GMT

Overview

Concrete tool for building and configuring the DPO alignment trainer instance provided by the Axolotl framework.

Description

The HFRLTrainerBuilder class assembles all components for DPO training. The build() method selects the appropriate strategy (DPO, KTO, GRPO) via DPOStrategy or KTOStrategy, configures DPO-specific training arguments (loss type, label smoothing, max length, generation during eval), and returns an AxolotlDPOTrainer instance ready for training.

AxolotlDPOTrainer extends TRL's DPOTrainer with Axolotl-specific mixins for custom optimizers, schedulers, RNG state management, and distributed save handling.

DPOStrategy is a helper class that:

  • Returns the correct trainer class (AxolotlDPOTrainer)
  • Returns the correct training args class (AxolotlDPOConfig)
  • Sets DPO-specific training argument kwargs (loss_type, label_smoothing, etc.)

Usage

This implementation is used internally when cfg.rl is set to "dpo", "ipo", or "simpo". The setup_trainer utility function routes to this builder for RL-type training.

Code Reference

Source Location

  • Repository: axolotl
  • File: src/axolotl/core/builders/rl.py (builder), src/axolotl/core/trainers/dpo/trainer.py (trainer), src/axolotl/core/trainers/dpo/__init__.py (strategy)
  • Lines: rl.py L25-242 (class), L195-242 (build); dpo/trainer.py (AxolotlDPOTrainer); dpo/__init__.py L7-41 (DPOStrategy)

Signature

class HFRLTrainerBuilder:
    """Builder for reinforcement learning trainers (DPO, KTO, GRPO)."""

    def __init__(self, cfg, model, tokenizer, processor=None):
        """
        Args:
            cfg: Training configuration.
            model: Policy model to train.
            tokenizer: Tokenizer.
            processor: Optional multimodal processor.
        """

    def build(self, total_num_steps: int) -> AxolotlDPOTrainer:
        """Build a configured DPO trainer instance.

        Args:
            total_num_steps: Total training steps for scheduler.

        Returns:
            AxolotlDPOTrainer: Configured trainer ready for .train().
        """


class DPOStrategy:
    """Strategy for DPO training argument configuration."""

    @classmethod
    def get_trainer_class(cls):
        return AxolotlDPOTrainer

    @classmethod
    def set_training_args_kwargs(cls, cfg):
        """Return DPO-specific training argument overrides."""


class AxolotlDPOTrainer(
    RngLoaderMixin, SchedulerMixin, OptimizerMixin,
    OptimizerInitMixin, DPOTrainer, DistributedParallelMixin,
):
    """Extended TRL DPOTrainer with Axolotl-specific features."""

Import

from axolotl.core.builders.rl import HFRLTrainerBuilder
from axolotl.core.trainers.dpo import DPOStrategy
from axolotl.core.trainers.dpo.trainer import AxolotlDPOTrainer

I/O Contract

Inputs

Name Type Required Description
cfg DictDefault Yes Config with rl type, dpo_label_smoothing, dpo_generate_during_eval, sequence_len, etc.
model PreTrainedModel or PeftModel Yes Policy model to train
tokenizer PreTrainedTokenizer Yes Tokenizer for data processing
total_num_steps int Yes (for build()) Total training steps for scheduler
model_ref PreTrainedModel or None No Reference model (None for LoRA auto-unwrap or ORPO)
peft_config PeftConfig or None No PEFT config for LoRA training

Outputs

Name Type Description
trainer AxolotlDPOTrainer Configured DPO trainer instance
train() returns TrainOutput Training metrics including DPO-specific losses (chosen/rejected rewards)

Usage Examples

Building DPO Trainer

from axolotl.core.builders.rl import HFRLTrainerBuilder

# cfg.rl = "dpo"
builder = HFRLTrainerBuilder(cfg, model, tokenizer)
trainer = builder.build(total_num_steps=1000)

# Set datasets and reference model
trainer.train_dataset = train_dataset
trainer.eval_dataset = eval_dataset

# Execute DPO training
result = trainer.train()

High-Level DPO Training

from axolotl.train import train

# cfg.rl = "dpo" triggers DPO path automatically
model, tokenizer, trainer = train(cfg=cfg, dataset_meta=dataset_meta)

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment