Principle:ContextualAI HALOs Iterative Alignment Loop

Knowledge Sources	ContextualAI HALOs Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models
Domains	NLP, Reinforcement_Learning, Infrastructure
Last Updated	2026-02-08 03:00 GMT

Overview

An orchestration pattern that repeatedly cycles through sampling, labeling, and training to progressively improve a language model through on-policy alignment.

Description

The iterative alignment loop is a multi-round orchestration pattern that chains together the sampling, labeling, and training steps. Each round:

Sample — Generate completions from the current policy
Label — Score completions with a reward model
Train — Run one pass of alignment training on the scored feedback
Repeat — Use the newly trained model as the policy for the next round

The loop divides a total prompt budget across rounds (e.g., 2048 total prompts / 4 rounds = 512 per round). After each round, old intermediate checkpoints are cleaned up to save disk space. The loop continues until all rounds are complete, producing a final aligned model.

This pattern is implemented as a shell script rather than a Python API because it orchestrates three separate processes (vLLM sampling, Accelerate labeling, Accelerate training) that have incompatible GPU memory requirements.

Usage

Use the iterative alignment loop when on-policy training is expected to outperform offline alignment. This is particularly beneficial when the SFT model's output distribution differs significantly from the static preference dataset, or when the model needs to iteratively improve on its own weaknesses.

Theoretical Basis

The iterative loop implements an approximate policy iteration scheme:

$π_{0} \leftarrow SFT model$
For round k=1,...,K:
1. Sample: $D_{k} = {(x, y) : y \sim π_{k - 1} (\cdot | x)}$
2. Label: $D_{k}^{*} = {(x, y, r (y))}$ using reward model
3. Train: $π_{k} \leftarrow align (π_{k - 1}, D_{k}^{*}, π_{0})$

The reference model $π_{0}$ remains fixed throughout all rounds (always the SFT checkpoint), preventing catastrophic drift.

Related Pages

Implemented By

Implementation:ContextualAI_HALOs_Online_Loop_Script

Uses Heuristic

Heuristic:ContextualAI_HALOs_Online_Round_Budgeting

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment