Workflow:Huggingface Alignment handbook SFT DPO Alignment Pipeline

Knowledge Sources	Alignment Handbook Zephyr Technical Report TRL Documentation Zephyr 7B Beta
Domains	LLMs, Fine_Tuning, Preference_Alignment
Last Updated	2026-02-07 00:00 GMT

Overview

End-to-end two-stage process for aligning a base language model to follow instructions and match human preferences using supervised fine-tuning (SFT) followed by direct preference optimization (DPO).

Description

This workflow implements the standard alignment pipeline used to produce models like Zephyr-7B-Beta. The process takes a pretrained base model (e.g., Mistral-7B) and transforms it into an instruction-following chat model in two stages. First, supervised fine-tuning teaches the model to follow instructions by training on curated dialogue datasets (e.g., UltraChat 200k). Second, direct preference optimization refines the model's outputs to better align with human preferences by training on chosen/rejected response pairs (e.g., UltraFeedback binarized). The pipeline uses HuggingFace TRL trainers, supports distributed training with DeepSpeed ZeRO-3 or FSDP, and is fully config-driven via YAML recipe files parsed by TrlParser.

Usage

Execute this workflow when you have a pretrained base language model and want to create an instruction-following chat model that aligns with human preferences. This is the recommended approach when you have access to multi-GPU hardware (e.g., 8 x A100 80GB), separate SFT and preference datasets, and want maximum control over each training stage. The two-stage approach allows independent tuning of instruction-following and preference alignment.

Execution Steps

Step 1: Environment Setup and Configuration

Prepare the training environment by installing the alignment-handbook package with its dependencies (transformers, TRL, DeepSpeed, Flash Attention 2). Authenticate with the Hugging Face Hub for dataset access and model uploading. Select or create a YAML recipe config that specifies the base model, dataset, training hyperparameters, and distributed training strategy.

Key considerations:

Pin PyTorch and Flash Attention versions for reproducibility
Choose an accelerate config matching your hardware (DDP, FSDP, or DeepSpeed ZeRO-3)
Ensure sufficient GPU memory for the chosen model size and batch configuration

Step 2: Dataset Preparation

Load and prepare the training datasets using the alignment-handbook's dataset loading utilities. For the SFT stage, data must be in chat message format with role/content pairs. For the DPO stage, data must include chosen and rejected response pairs. The library supports single datasets or weighted mixtures of multiple datasets with configurable column selection and train/test splitting.

Key considerations:

SFT datasets require a messages column with role/content dicts
DPO datasets require chosen and rejected columns
Dataset mixtures allow weighted blending with the dataset_mixture config
A custom chat template can be applied via the config to format prompts consistently

Step 3: SFT Training

Train the base model on the instruction-following dataset using the SFTTrainer from TRL. This stage teaches the model to generate helpful responses in a conversational format. The training script loads the base model with optional quantization, applies a chat template to the tokenizer, initializes the SFT trainer with the dataset, and runs the training loop with checkpoint management.

Key considerations:

If no chat template exists on the tokenizer, ChatML is applied by default
Gradient checkpointing reduces memory usage for large models
The trained SFT model becomes the input for the DPO stage
Model card creation and Hub pushing happen automatically if configured

Step 4: DPO Training

Align the SFT model with human preferences using DPOTrainer from TRL. This stage loads the SFT checkpoint as both the policy model and reference model, then trains on preference pairs to maximize the likelihood gap between chosen and rejected responses. The DPO beta parameter controls the strength of the KL divergence constraint.

Key considerations:

The model_name_or_path in the DPO config should point to the SFT output
A separate reference model is loaded to compute the KL penalty
The beta parameter (typically 0.01-0.1) balances preference alignment vs. divergence from SFT
Max prompt length and max total length must be set to control memory usage

Step 5: Model Saving and Publishing

Save the final aligned model, generate a model card with training metadata, and optionally push to the Hugging Face Hub. The generation config is updated to use the correct EOS token, and the KV cache is re-enabled for efficient inference.

Key considerations:

The generation config EOS token is aligned with the tokenizer to prevent unbounded generation
Model card includes dataset name and alignment-handbook tags
Push-to-hub is controlled by the push_to_hub config flag

Step 6: Evaluation

Evaluate the aligned model on standard chat benchmarks to measure improvement from alignment. Recommended benchmarks include MT-Bench for multi-turn dialogue quality and AlpacaEval for single-turn helpfulness.

Key considerations:

MT-Bench requires the model name to contain "zephyr" for correct chat template loading
Both benchmarks use LLM-as-judge (GPT-4) which introduces evaluation biases
Human evaluation via Chatbot Arena provides complementary signal

Execution Diagram

GitHub URL

Workflow Repository