Workflow:ContextualAI HALOs Online Iterative Alignment
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Alignment, Online_Learning, LLM_Ops |
| Last Updated | 2026-02-08 03:00 GMT |
Overview
Iterative alignment pipeline that alternates between sampling from the current policy, labeling those samples with a reward signal, and retraining the policy on the new feedback data across multiple rounds.
Description
This workflow implements online/iterative alignment, where the model progressively improves through multiple rounds of self-play. Unlike offline alignment (which trains once on a static dataset), online alignment generates new training data from the model itself at each round, labels it with a reward model or API, and then updates the policy. This creates a feedback loop where the training distribution matches the current model's capabilities, leading to more effective alignment. The workflow supports both DPO (pairwise) and KTO (binary/unpaired) alignment losses.
Goals:
- Produce a progressively improving aligned model through iterative self-play
- Generate on-policy training data rather than relying on a static dataset
- Use a reward model or API to provide feedback on model-generated responses
Scope:
- From an SFT checkpoint through multiple rounds of sample-label-train
- Covers vLLM sampling, reward model labeling or API labeling, and iterative training
Strategy:
- Each round: sample from current policy using vLLM, label with reward model, train on new data
- Policy resumes from the previous round's checkpoint; reference model stays at the initial SFT checkpoint
- Configurable number of prompts per round and total prompts
- Old checkpoints are cleaned up to save disk space
Usage
Execute this workflow when you want on-policy alignment, where the model trains on its own generated outputs rather than a static dataset. This is appropriate when you have access to a reward model (trained via the Reward Model Training workflow) or an API for scoring, and you want the training distribution to match the current model. Online alignment typically outperforms offline alignment when sufficient compute is available for multiple rounds.
Execution Steps
Step 1: Prerequisite_Setup
Prepare the necessary components before starting the iterative loop. This includes having a trained SFT checkpoint (from the Offline SFT Alignment Pipeline workflow) and a reward model for labeling (from the Reward Model Training workflow) or API credentials for LLM-as-judge scoring. Configure the total number of prompts and prompts per round.
Key considerations:
- The SFT checkpoint serves as both the initial policy and the frozen reference model
- The reward model must be compatible with the labeling script (Bradley-Terry model or ArmoRM)
- Alternatively, OpenAI API can be used for labeling instead of a local reward model
- Total prompts divided by prompts per round determines the number of training rounds
Step 2: Sample_From_Policy
Use vLLM to generate multiple responses per prompt from the current policy model. The sampling script loads the model with tensor parallelism and generates outputs for a batch of prompts from the training set. Multiple samples per prompt are needed to create preference pairs.
What happens:
- vLLM loads the current checkpoint with tensor parallel across available GPUs
- Prompts are drawn from the specified dataset (e.g., AlpacaEval, UltraFeedback)
- For each prompt, multiple samples are generated (e.g., 4 for DPO, 8 for KTO)
- Samples are written to a JSON file with prompt, output, prompt_id, and sample_id fields
- The num_skip parameter ensures different prompts are used in each round
Step 3: Label_Samples
Score the generated samples using either a reward model or an API-based judge. The labeling script assigns a scalar reward to each sample, which is then used to construct preference feedback (pairwise or binary). For pairwise feedback, pairs are formed from samples of the same prompt with different rewards.
What happens:
- If using a reward model: the model scores each (prompt, response) pair using distributed inference via Accelerate
- If using an API: samples are scored asynchronously through the OpenAI API with a quality rubric
- Scored samples are converted to feedback format:
- Pairwise: pairs of (chosen, rejected) based on reward scores with configurable thresholds and modes (random, max-gap, min-gap)
- Binary: each sample labeled as desirable/undesirable based on a threshold (mean, median, or fixed)
- Feedback is written to a JSON file compatible with HALOs dataloaders
Step 4: Train_On_Feedback
Run a single round of alignment training on the newly generated and labeled data. The policy resumes from the previous round's checkpoint, while the reference model remains fixed at the original SFT checkpoint. The online=true flag signals the trainer to handle the iterative training context.
Key considerations:
- First round: model.load_from points to the SFT checkpoint
- Subsequent rounds: model.from_checkpoint resumes the previous policy; model.load_from loads the SFT reference
- The online=true flag adjusts training behavior for the iterative context
- Each round trains on only the data generated in that round
- Optimizer and scheduler states are preserved across rounds via checkpointing
Step 5: Iterate_Or_Complete
After each training round, update the current checkpoint pointer and decide whether to continue. If more prompts remain and samples were successfully generated, loop back to Step 2 with the newly trained model. Otherwise, end the iterative process. Old checkpoint directories are cleaned up to conserve disk space.
What happens:
- The current checkpoint pointer is updated to the newly saved model
- The cumulative prompt counter is incremented
- If prompts remain and the last sampling produced outputs, the loop continues
- If no new samples were generated (exit code 1) or all prompts are exhausted, training ends
- Previous round directories (except SFT) are deleted to save space