Workflow:Axolotl ai cloud Axolotl DPO Preference Alignment

Knowledge Sources	Axolotl Axolotl Docs Dataset Formats
Domains	LLMs, Alignment, DPO, Preference_Optimization
Last Updated	2026-02-06 22:00 GMT

Overview

End-to-end process for aligning a fine-tuned language model with human preferences using Direct Preference Optimization (DPO) via Axolotl's unified YAML-driven configuration.

Description

This workflow covers preference-based alignment training where a model learns to distinguish between preferred (chosen) and dispreferred (rejected) responses. Axolotl supports DPO and related methods (IPO, KTO, ORPO) through a unified interface. The process involves loading paired preference data, setting up a policy model with optional LoRA adapters, configuring a reference model for KL-divergence regularization, and running the DPO training loop via TRL's DPO trainer. This is typically performed as a second stage after supervised fine-tuning to improve response quality and safety.

Usage

Execute this workflow when you have a preference dataset containing pairs of chosen and rejected responses for the same prompt, and you want to align a pre-trained or SFT-trained model to produce responses that better match human preferences. Common scenarios include improving helpfulness, reducing harmful outputs, and refining response style.

Execution Steps

Step 1: Configuration

Create a YAML configuration file specifying the base model (typically an SFT-trained checkpoint), paired preference dataset, RL method, and training parameters. The key differentiator from SFT configs is the inclusion of rl: dpo and dataset fields for field_chosen and field_rejected.

Key considerations:

Set rl: dpo (or ipo, kto, orpo) in the config
Dataset must contain both chosen and rejected response fields
Use chat_template type for conversational preference data
Configure message_property_mappings for role/content field names
Sample packing is typically disabled for DPO training

Step 2: Preference Dataset Loading

Load and format the paired preference dataset. Axolotl's preference data loader handles both standard DPO format (chosen/rejected pairs) and conversational format with message histories. The loader applies chat templates, tokenizes both chosen and rejected completions, and prepares the data for the DPO trainer.

Key considerations:

Use chat_template.default type for conversational DPO data
Ensure consistent schema across all preference pairs
Roles mapping must match your dataset's structure (system/user/assistant)
The loader handles conversation-level formatting automatically

Step 3: Policy Model Loading

Load the policy model (the model to be aligned) with optional adapter configuration. For memory efficiency, LoRA or QLoRA adapters can be applied. The model is loaded with the same infrastructure as SFT training, including quantization and attention mechanism patches.

Key considerations:

The base model is typically an SFT-trained checkpoint
LoRA adapters reduce memory requirements significantly
Flash attention and gradient checkpointing are recommended
For ORPO, no reference model is needed

Step 4: Reference Model Setup

Configure the reference model used to compute KL-divergence regularization. This prevents the policy model from diverging too far from its initial behavior during alignment. When using LoRA adapters, Axolotl can use the base model (without adapter weights) as the implicit reference model, avoiding the need to load a separate copy.

Key considerations:

With LoRA adapters, the reference model is implicit (base model without adapters)
Without adapters, a separate copy of the model must be loaded
ORPO does not require a reference model
Reference model weights are always frozen

Step 5: DPO Training Execution

Run the DPO training loop using TRL's DPO trainer integrated into Axolotl. The trainer computes the DPO loss by comparing log-probabilities of chosen vs rejected responses under both the policy and reference models, then updates the policy to increase the probability gap in favor of chosen responses.

Key considerations:

DPO loss encourages the model to prefer chosen over rejected responses
The beta parameter controls the strength of KL regularization
Training is typically shorter than SFT (fewer epochs)
Monitor chosen/rejected reward margins during training

Step 6: Aligned Model Saving

Save the aligned model (or adapter weights if using LoRA) to the output directory. The aligned model can be merged with the base model for deployment, or the adapter weights can be served alongside the base model.

Key considerations:

Adapter-only saving produces small checkpoint files
Use axolotl merge-lora for deployment-ready merged models
Evaluate alignment quality before deployment
Model card is generated with training metadata

Execution Diagram

GitHub URL

Workflow Repository