Workflow:ContextualAI HALOs Offline SFT Alignment Pipeline
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Fine_Tuning, Alignment, LLM_Ops |
| Last Updated | 2026-02-08 03:00 GMT |
Overview
End-to-end process for supervised fine-tuning (SFT) a base language model on instruction data, then aligning it with a human preference method such as DPO, KTO, or GRPO, and finally evaluating the result.
Description
This workflow covers the most common alignment pipeline in HALOs: starting from a pretrained base model (e.g., Llama-3-8B), performing supervised fine-tuning on instruction-following data to produce a capable instruction model, then applying an offline preference alignment method to improve response quality based on human feedback signals. The pipeline uses Hydra for configuration management, Accelerate with FSDP for distributed training, and supports 11 different alignment losses. The final model is saved and can be evaluated on standard benchmarks.
Goals:
- Produce an aligned language model from a pretrained base
- First stage (SFT) teaches the model to follow instructions
- Second stage (alignment) optimizes the model toward preferred behaviors using human feedback
Scope:
- From a HuggingFace base model to a saved, aligned checkpoint
- Covers data loading, tokenization, model setup, training loop, and checkpoint saving
Strategy:
- Uses FSDP for memory-efficient multi-GPU training
- Supports LoRA for parameter-efficient fine-tuning
- Reference model logprobs can be cached to save GPU memory during alignment
- Modular loss configs allow switching alignment methods via a single YAML parameter
Usage
Execute this workflow when you have a pretrained base language model and want to create an instruction-following, preference-aligned model. This is the standard two-stage pipeline: first SFT on instruction data (e.g., UltraFeedback binarized), then alignment with a preference loss (e.g., KTO on binary feedback or DPO on pairwise preferences). Use this when you have a static preference dataset and do not need iterative online alignment.
Execution Steps
Step 1: Environment_Setup
Install all required dependencies using the provided installation script. This creates a Conda environment with pinned versions of PyTorch, Transformers, PEFT, Accelerate, vLLM, and evaluation tools. Configure Weights and Biases for experiment tracking.
Key considerations:
- Package versions are pinned for reproducibility; changing them may break the code
- Flash Attention requires a compatible GPU architecture
- Set up wandb with wandb login before training; use wandb offline if GPUs lack internet access
Step 2: Data_Preparation
Select or prepare the training dataset. HALOs provides built-in loaders for 12+ datasets (UltraFeedback, SHP, HH, OASST, etc.) accessible by name. Custom datasets can be provided as JSON files following the binary feedback or pairwise feedback schema. The data is loaded through dataset-specific get_{name} functions in the data module.
Key considerations:
- SFT uses SFTDataLoader which formats examples with the chat template
- Alignment methods use either PairedPreferenceDataLoader (DPO, CDPO, IPO, SimPO, SLiC) or UnpairedPreferenceDataLoader (KTO)
- Datasets can be combined by passing a list of names
- Custom JSON datasets must match the expected schema (see examples/binary_feedback.json or examples/pairwise_feedback.json)
Step 3: SFT_Training
Fine-tune the base model on instruction-following data using the SFT loss. This loads the pretrained model, applies the tokenizer with a chat template, creates the SFT dataloader, and runs the training loop with a cosine learning rate schedule. The model is distributed across GPUs using FSDP.
What happens:
- Hydra composes the configuration from config/config.yaml, config/loss/sft.yaml, and config/model/{model}.yaml
- Accelerator initializes FSDP with the specified number of GPUs
- Tokenizer loads and applies chat template; special tokens are added if needed
- SFTDataLoader tokenizes and batches examples
- SFTTrainer runs the training loop with cross-entropy loss on assistant tokens
- Model checkpoint is saved to cache_dir/exp_name/FINAL
Step 4: Alignment_Training
Load the SFT checkpoint and align it with a preference optimization method. The alignment stage loads both a policy model (initialized from SFT) and a reference model (frozen copy of SFT). The loss function is determined by the loss config (e.g., kto.yaml selects KTOTrainer). Training optimizes the policy to increase probability of preferred responses relative to the reference.
What happens:
- The loss config specifies which Trainer and DataLoader classes to use
- Reference model loads from the SFT checkpoint and is frozen
- If cache_reference_logprobs=true, reference logprobs are precomputed and the reference model is freed from GPU memory
- Policy model loads from the SFT checkpoint and is trainable
- Optional LoRA can be applied for parameter-efficient training
- Training runs with the specified alignment loss (DPO, KTO, GRPO, etc.)
- Aligned model is saved to cache_dir/exp_name/FINAL
Step 5: Model_Saving
After training completes, the final model weights are saved along with the tokenizer and training metrics. If LoRA was used, the adapter weights are merged back into the base model before saving, producing a standalone model that can be loaded without PEFT.
Key considerations:
- The final checkpoint is saved in cache_dir/exp_name/FINAL
- Intermediate checkpoints can be enabled with intermediate_checkpoints=true
- A config.yaml is saved alongside the model for reproducibility
- Training metrics (including the example counter for resuming) are saved in metrics.json