Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:ContextualAI HALOs Iterative Alignment Loop

From Leeroopedia


Knowledge Sources
Domains NLP, Reinforcement_Learning, Infrastructure
Last Updated 2026-02-08 03:00 GMT

Overview

An orchestration pattern that repeatedly cycles through sampling, labeling, and training to progressively improve a language model through on-policy alignment.

Description

The iterative alignment loop is a multi-round orchestration pattern that chains together the sampling, labeling, and training steps. Each round:

  1. Sample — Generate completions from the current policy
  2. Label — Score completions with a reward model
  3. Train — Run one pass of alignment training on the scored feedback
  4. Repeat — Use the newly trained model as the policy for the next round

The loop divides a total prompt budget across rounds (e.g., 2048 total prompts / 4 rounds = 512 per round). After each round, old intermediate checkpoints are cleaned up to save disk space. The loop continues until all rounds are complete, producing a final aligned model.

This pattern is implemented as a shell script rather than a Python API because it orchestrates three separate processes (vLLM sampling, Accelerate labeling, Accelerate training) that have incompatible GPU memory requirements.

Usage

Use the iterative alignment loop when on-policy training is expected to outperform offline alignment. This is particularly beneficial when the SFT model's output distribution differs significantly from the static preference dataset, or when the model needs to iteratively improve on its own weaknesses.

Theoretical Basis

The iterative loop implements an approximate policy iteration scheme:

  1. π0SFT model
  2. For round k=1,...,K:
    1. Sample: Dk={(x,y):yπk1(|x)}
    2. Label: Dk*={(x,y,r(y))} using reward model
    3. Train: πkalign(πk1,Dk*,π0)

The reference model π0 remains fixed throughout all rounds (always the SFT checkpoint), preventing catastrophic drift.

Related Pages

Implemented By

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment