Principle:ContextualAI HALOs Iterative Alignment Loop
| Knowledge Sources | |
|---|---|
| Domains | NLP, Reinforcement_Learning, Infrastructure |
| Last Updated | 2026-02-08 03:00 GMT |
Overview
An orchestration pattern that repeatedly cycles through sampling, labeling, and training to progressively improve a language model through on-policy alignment.
Description
The iterative alignment loop is a multi-round orchestration pattern that chains together the sampling, labeling, and training steps. Each round:
- Sample — Generate completions from the current policy
- Label — Score completions with a reward model
- Train — Run one pass of alignment training on the scored feedback
- Repeat — Use the newly trained model as the policy for the next round
The loop divides a total prompt budget across rounds (e.g., 2048 total prompts / 4 rounds = 512 per round). After each round, old intermediate checkpoints are cleaned up to save disk space. The loop continues until all rounds are complete, producing a final aligned model.
This pattern is implemented as a shell script rather than a Python API because it orchestrates three separate processes (vLLM sampling, Accelerate labeling, Accelerate training) that have incompatible GPU memory requirements.
Usage
Use the iterative alignment loop when on-policy training is expected to outperform offline alignment. This is particularly beneficial when the SFT model's output distribution differs significantly from the static preference dataset, or when the model needs to iteratively improve on its own weaknesses.
Theoretical Basis
The iterative loop implements an approximate policy iteration scheme:
- For round :
- Sample:
- Label: using reward model
- Train:
The reference model remains fixed throughout all rounds (always the SFT checkpoint), preventing catastrophic drift.