Workflow:ContextualAI HALOs Online Iterative Alignment

Knowledge Sources	HALOs KTO: Model Alignment as Prospect Theoretic Optimization vLLM
Domains	LLMs, Alignment, Online_Learning, LLM_Ops
Last Updated	2026-02-08 03:00 GMT

Overview

Iterative alignment pipeline that alternates between sampling from the current policy, labeling those samples with a reward signal, and retraining the policy on the new feedback data across multiple rounds.

Description

This workflow implements online/iterative alignment, where the model progressively improves through multiple rounds of self-play. Unlike offline alignment (which trains once on a static dataset), online alignment generates new training data from the model itself at each round, labels it with a reward model or API, and then updates the policy. This creates a feedback loop where the training distribution matches the current model's capabilities, leading to more effective alignment. The workflow supports both DPO (pairwise) and KTO (binary/unpaired) alignment losses.

Goals:

Produce a progressively improving aligned model through iterative self-play
Generate on-policy training data rather than relying on a static dataset
Use a reward model or API to provide feedback on model-generated responses

Scope:

From an SFT checkpoint through multiple rounds of sample-label-train
Covers vLLM sampling, reward model labeling or API labeling, and iterative training

Strategy:

Each round: sample from current policy using vLLM, label with reward model, train on new data
Policy resumes from the previous round's checkpoint; reference model stays at the initial SFT checkpoint
Configurable number of prompts per round and total prompts
Old checkpoints are cleaned up to save disk space

Usage

Execute this workflow when you want on-policy alignment, where the model trains on its own generated outputs rather than a static dataset. This is appropriate when you have access to a reward model (trained via the Reward Model Training workflow) or an API for scoring, and you want the training distribution to match the current model. Online alignment typically outperforms offline alignment when sufficient compute is available for multiple rounds.

Execution Steps

Step 1: Prerequisite_Setup

Prepare the necessary components before starting the iterative loop. This includes having a trained SFT checkpoint (from the Offline SFT Alignment Pipeline workflow) and a reward model for labeling (from the Reward Model Training workflow) or API credentials for LLM-as-judge scoring. Configure the total number of prompts and prompts per round.

Key considerations:

The SFT checkpoint serves as both the initial policy and the frozen reference model
The reward model must be compatible with the labeling script (Bradley-Terry model or ArmoRM)
Alternatively, OpenAI API can be used for labeling instead of a local reward model
Total prompts divided by prompts per round determines the number of training rounds

Step 2: Sample_From_Policy

Use vLLM to generate multiple responses per prompt from the current policy model. The sampling script loads the model with tensor parallelism and generates outputs for a batch of prompts from the training set. Multiple samples per prompt are needed to create preference pairs.

What happens:

vLLM loads the current checkpoint with tensor parallel across available GPUs
Prompts are drawn from the specified dataset (e.g., AlpacaEval, UltraFeedback)
For each prompt, multiple samples are generated (e.g., 4 for DPO, 8 for KTO)
Samples are written to a JSON file with prompt, output, prompt_id, and sample_id fields
The num_skip parameter ensures different prompts are used in each round

Step 3: Label_Samples

Score the generated samples using either a reward model or an API-based judge. The labeling script assigns a scalar reward to each sample, which is then used to construct preference feedback (pairwise or binary). For pairwise feedback, pairs are formed from samples of the same prompt with different rewards.

What happens:

If using a reward model: the model scores each (prompt, response) pair using distributed inference via Accelerate
If using an API: samples are scored asynchronously through the OpenAI API with a quality rubric
Scored samples are converted to feedback format:
- Pairwise: pairs of (chosen, rejected) based on reward scores with configurable thresholds and modes (random, max-gap, min-gap)
- Binary: each sample labeled as desirable/undesirable based on a threshold (mean, median, or fixed)
Feedback is written to a JSON file compatible with HALOs dataloaders

Step 4: Train_On_Feedback

Run a single round of alignment training on the newly generated and labeled data. The policy resumes from the previous round's checkpoint, while the reference model remains fixed at the original SFT checkpoint. The online=true flag signals the trainer to handle the iterative training context.

Key considerations:

First round: model.load_from points to the SFT checkpoint
Subsequent rounds: model.from_checkpoint resumes the previous policy; model.load_from loads the SFT reference
The online=true flag adjusts training behavior for the iterative context
Each round trains on only the data generated in that round
Optimizer and scheduler states are preserved across rounds via checkpointing

Step 5: Iterate_Or_Complete

After each training round, update the current checkpoint pointer and decide whether to continue. If more prompts remain and samples were successfully generated, loop back to Step 2 with the newly trained model. Otherwise, end the iterative process. Old checkpoint directories are cleaned up to conserve disk space.

What happens:

The current checkpoint pointer is updated to the newly saved model
The cumulative prompt counter is incremented
If prompts remain and the last sampling produced outputs, the loop continues
If no new samples were generated (exit code 1) or all prompts are exhausted, training ends
Previous round directories (except SFT) are deleted to save space

Execution Diagram

GitHub URL

Workflow Repository