Workflow:Huggingface Alignment handbook SFT DPO Alignment Pipeline
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Fine_Tuning, Preference_Alignment |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
End-to-end two-stage process for aligning a base language model to follow instructions and match human preferences using supervised fine-tuning (SFT) followed by direct preference optimization (DPO).
Description
This workflow implements the standard alignment pipeline used to produce models like Zephyr-7B-Beta. The process takes a pretrained base model (e.g., Mistral-7B) and transforms it into an instruction-following chat model in two stages. First, supervised fine-tuning teaches the model to follow instructions by training on curated dialogue datasets (e.g., UltraChat 200k). Second, direct preference optimization refines the model's outputs to better align with human preferences by training on chosen/rejected response pairs (e.g., UltraFeedback binarized). The pipeline uses HuggingFace TRL trainers, supports distributed training with DeepSpeed ZeRO-3 or FSDP, and is fully config-driven via YAML recipe files parsed by TrlParser.
Usage
Execute this workflow when you have a pretrained base language model and want to create an instruction-following chat model that aligns with human preferences. This is the recommended approach when you have access to multi-GPU hardware (e.g., 8 x A100 80GB), separate SFT and preference datasets, and want maximum control over each training stage. The two-stage approach allows independent tuning of instruction-following and preference alignment.
Execution Steps
Step 1: Environment Setup and Configuration
Prepare the training environment by installing the alignment-handbook package with its dependencies (transformers, TRL, DeepSpeed, Flash Attention 2). Authenticate with the Hugging Face Hub for dataset access and model uploading. Select or create a YAML recipe config that specifies the base model, dataset, training hyperparameters, and distributed training strategy.
Key considerations:
- Pin PyTorch and Flash Attention versions for reproducibility
- Choose an accelerate config matching your hardware (DDP, FSDP, or DeepSpeed ZeRO-3)
- Ensure sufficient GPU memory for the chosen model size and batch configuration
Step 2: Dataset Preparation
Load and prepare the training datasets using the alignment-handbook's dataset loading utilities. For the SFT stage, data must be in chat message format with role/content pairs. For the DPO stage, data must include chosen and rejected response pairs. The library supports single datasets or weighted mixtures of multiple datasets with configurable column selection and train/test splitting.
Key considerations:
- SFT datasets require a messages column with role/content dicts
- DPO datasets require chosen and rejected columns
- Dataset mixtures allow weighted blending with the dataset_mixture config
- A custom chat template can be applied via the config to format prompts consistently
Step 3: SFT Training
Train the base model on the instruction-following dataset using the SFTTrainer from TRL. This stage teaches the model to generate helpful responses in a conversational format. The training script loads the base model with optional quantization, applies a chat template to the tokenizer, initializes the SFT trainer with the dataset, and runs the training loop with checkpoint management.
Key considerations:
- If no chat template exists on the tokenizer, ChatML is applied by default
- Gradient checkpointing reduces memory usage for large models
- The trained SFT model becomes the input for the DPO stage
- Model card creation and Hub pushing happen automatically if configured
Step 4: DPO Training
Align the SFT model with human preferences using DPOTrainer from TRL. This stage loads the SFT checkpoint as both the policy model and reference model, then trains on preference pairs to maximize the likelihood gap between chosen and rejected responses. The DPO beta parameter controls the strength of the KL divergence constraint.
Key considerations:
- The model_name_or_path in the DPO config should point to the SFT output
- A separate reference model is loaded to compute the KL penalty
- The beta parameter (typically 0.01-0.1) balances preference alignment vs. divergence from SFT
- Max prompt length and max total length must be set to control memory usage
Step 5: Model Saving and Publishing
Save the final aligned model, generate a model card with training metadata, and optionally push to the Hugging Face Hub. The generation config is updated to use the correct EOS token, and the KV cache is re-enabled for efficient inference.
Key considerations:
- The generation config EOS token is aligned with the tokenizer to prevent unbounded generation
- Model card includes dataset name and alignment-handbook tags
- Push-to-hub is controlled by the push_to_hub config flag
Step 6: Evaluation
Evaluate the aligned model on standard chat benchmarks to measure improvement from alignment. Recommended benchmarks include MT-Bench for multi-turn dialogue quality and AlpacaEval for single-turn helpfulness.
Key considerations:
- MT-Bench requires the model name to contain "zephyr" for correct chat template loading
- Both benchmarks use LLM-as-judge (GPT-4) which introduces evaluation biases
- Human evaluation via Chatbot Arena provides complementary signal