Workflow:ContextualAI HALOs Offline SFT Alignment Pipeline

Knowledge Sources	HALOs KTO: Model Alignment as Prospect Theoretic Optimization Hydra Configuration Accelerate FSDP
Domains	LLMs, Fine_Tuning, Alignment, LLM_Ops
Last Updated	2026-02-08 03:00 GMT

Overview

End-to-end process for supervised fine-tuning (SFT) a base language model on instruction data, then aligning it with a human preference method such as DPO, KTO, or GRPO, and finally evaluating the result.

Description

This workflow covers the most common alignment pipeline in HALOs: starting from a pretrained base model (e.g., Llama-3-8B), performing supervised fine-tuning on instruction-following data to produce a capable instruction model, then applying an offline preference alignment method to improve response quality based on human feedback signals. The pipeline uses Hydra for configuration management, Accelerate with FSDP for distributed training, and supports 11 different alignment losses. The final model is saved and can be evaluated on standard benchmarks.

Goals:

Produce an aligned language model from a pretrained base
First stage (SFT) teaches the model to follow instructions
Second stage (alignment) optimizes the model toward preferred behaviors using human feedback

Scope:

From a HuggingFace base model to a saved, aligned checkpoint
Covers data loading, tokenization, model setup, training loop, and checkpoint saving

Strategy:

Uses FSDP for memory-efficient multi-GPU training
Supports LoRA for parameter-efficient fine-tuning
Reference model logprobs can be cached to save GPU memory during alignment
Modular loss configs allow switching alignment methods via a single YAML parameter

Usage

Execute this workflow when you have a pretrained base language model and want to create an instruction-following, preference-aligned model. This is the standard two-stage pipeline: first SFT on instruction data (e.g., UltraFeedback binarized), then alignment with a preference loss (e.g., KTO on binary feedback or DPO on pairwise preferences). Use this when you have a static preference dataset and do not need iterative online alignment.

Execution Steps

Step 1: Environment_Setup

Install all required dependencies using the provided installation script. This creates a Conda environment with pinned versions of PyTorch, Transformers, PEFT, Accelerate, vLLM, and evaluation tools. Configure Weights and Biases for experiment tracking.

Key considerations:

Package versions are pinned for reproducibility; changing them may break the code
Flash Attention requires a compatible GPU architecture
Set up wandb with wandb login before training; use wandb offline if GPUs lack internet access

Step 2: Data_Preparation

Select or prepare the training dataset. HALOs provides built-in loaders for 12+ datasets (UltraFeedback, SHP, HH, OASST, etc.) accessible by name. Custom datasets can be provided as JSON files following the binary feedback or pairwise feedback schema. The data is loaded through dataset-specific get_{name} functions in the data module.

Key considerations:

SFT uses SFTDataLoader which formats examples with the chat template
Alignment methods use either PairedPreferenceDataLoader (DPO, CDPO, IPO, SimPO, SLiC) or UnpairedPreferenceDataLoader (KTO)
Datasets can be combined by passing a list of names
Custom JSON datasets must match the expected schema (see examples/binary_feedback.json or examples/pairwise_feedback.json)

Step 3: SFT_Training

Fine-tune the base model on instruction-following data using the SFT loss. This loads the pretrained model, applies the tokenizer with a chat template, creates the SFT dataloader, and runs the training loop with a cosine learning rate schedule. The model is distributed across GPUs using FSDP.

What happens:

Hydra composes the configuration from config/config.yaml, config/loss/sft.yaml, and config/model/{model}.yaml
Accelerator initializes FSDP with the specified number of GPUs
Tokenizer loads and applies chat template; special tokens are added if needed
SFTDataLoader tokenizes and batches examples
SFTTrainer runs the training loop with cross-entropy loss on assistant tokens
Model checkpoint is saved to cache_dir/exp_name/FINAL

Step 4: Alignment_Training

Load the SFT checkpoint and align it with a preference optimization method. The alignment stage loads both a policy model (initialized from SFT) and a reference model (frozen copy of SFT). The loss function is determined by the loss config (e.g., kto.yaml selects KTOTrainer). Training optimizes the policy to increase probability of preferred responses relative to the reference.

What happens:

The loss config specifies which Trainer and DataLoader classes to use
Reference model loads from the SFT checkpoint and is frozen
If cache_reference_logprobs=true, reference logprobs are precomputed and the reference model is freed from GPU memory
Policy model loads from the SFT checkpoint and is trainable
Optional LoRA can be applied for parameter-efficient training
Training runs with the specified alignment loss (DPO, KTO, GRPO, etc.)
Aligned model is saved to cache_dir/exp_name/FINAL

Step 5: Model_Saving

After training completes, the final model weights are saved along with the tokenizer and training metrics. If LoRA was used, the adapter weights are merged back into the base model before saving, producing a standalone model that can be loaded without PEFT.

Key considerations:

The final checkpoint is saved in cache_dir/exp_name/FINAL
Intermediate checkpoints can be enabled with intermediate_checkpoints=true
A config.yaml is saved alongside the model for reproducibility
Training metrics (including the example counter for resuming) are saved in metrics.json

Execution Diagram

GitHub URL

Workflow Repository