Workflow:Axolotl ai cloud Axolotl DPO Preference Alignment
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Alignment, DPO, Preference_Optimization |
| Last Updated | 2026-02-06 22:00 GMT |
Overview
End-to-end process for aligning a fine-tuned language model with human preferences using Direct Preference Optimization (DPO) via Axolotl's unified YAML-driven configuration.
Description
This workflow covers preference-based alignment training where a model learns to distinguish between preferred (chosen) and dispreferred (rejected) responses. Axolotl supports DPO and related methods (IPO, KTO, ORPO) through a unified interface. The process involves loading paired preference data, setting up a policy model with optional LoRA adapters, configuring a reference model for KL-divergence regularization, and running the DPO training loop via TRL's DPO trainer. This is typically performed as a second stage after supervised fine-tuning to improve response quality and safety.
Usage
Execute this workflow when you have a preference dataset containing pairs of chosen and rejected responses for the same prompt, and you want to align a pre-trained or SFT-trained model to produce responses that better match human preferences. Common scenarios include improving helpfulness, reducing harmful outputs, and refining response style.
Execution Steps
Step 1: Configuration
Create a YAML configuration file specifying the base model (typically an SFT-trained checkpoint), paired preference dataset, RL method, and training parameters. The key differentiator from SFT configs is the inclusion of rl: dpo and dataset fields for field_chosen and field_rejected.
Key considerations:
- Set
rl: dpo(oripo,kto,orpo) in the config - Dataset must contain both chosen and rejected response fields
- Use
chat_templatetype for conversational preference data - Configure
message_property_mappingsfor role/content field names - Sample packing is typically disabled for DPO training
Step 2: Preference Dataset Loading
Load and format the paired preference dataset. Axolotl's preference data loader handles both standard DPO format (chosen/rejected pairs) and conversational format with message histories. The loader applies chat templates, tokenizes both chosen and rejected completions, and prepares the data for the DPO trainer.
Key considerations:
- Use
chat_template.defaulttype for conversational DPO data - Ensure consistent schema across all preference pairs
- Roles mapping must match your dataset's structure (system/user/assistant)
- The loader handles conversation-level formatting automatically
Step 3: Policy Model Loading
Load the policy model (the model to be aligned) with optional adapter configuration. For memory efficiency, LoRA or QLoRA adapters can be applied. The model is loaded with the same infrastructure as SFT training, including quantization and attention mechanism patches.
Key considerations:
- The base model is typically an SFT-trained checkpoint
- LoRA adapters reduce memory requirements significantly
- Flash attention and gradient checkpointing are recommended
- For ORPO, no reference model is needed
Step 4: Reference Model Setup
Configure the reference model used to compute KL-divergence regularization. This prevents the policy model from diverging too far from its initial behavior during alignment. When using LoRA adapters, Axolotl can use the base model (without adapter weights) as the implicit reference model, avoiding the need to load a separate copy.
Key considerations:
- With LoRA adapters, the reference model is implicit (base model without adapters)
- Without adapters, a separate copy of the model must be loaded
- ORPO does not require a reference model
- Reference model weights are always frozen
Step 5: DPO Training Execution
Run the DPO training loop using TRL's DPO trainer integrated into Axolotl. The trainer computes the DPO loss by comparing log-probabilities of chosen vs rejected responses under both the policy and reference models, then updates the policy to increase the probability gap in favor of chosen responses.
Key considerations:
- DPO loss encourages the model to prefer chosen over rejected responses
- The beta parameter controls the strength of KL regularization
- Training is typically shorter than SFT (fewer epochs)
- Monitor chosen/rejected reward margins during training
Step 6: Aligned Model Saving
Save the aligned model (or adapter weights if using LoRA) to the output directory. The aligned model can be merged with the base model for deployment, or the adapter weights can be served alongside the base model.
Key considerations:
- Adapter-only saving produces small checkpoint files
- Use
axolotl merge-lorafor deployment-ready merged models - Evaluate alignment quality before deployment
- Model card is generated with training metadata