Heuristic:ContextualAI HALOs Online Round Budgeting

Knowledge Sources	ContextualAI HALOs
Domains	LLM_Alignment, Online_Training, Configuration
Last Updated	2026-02-08 03:00 GMT

Overview

Online iterative alignment divides training into fixed-size rounds (e.g., 512 prompts per round, 4 rounds total) with checkpoint management between rounds to balance fresh data generation against compute cost.

Description

The online training loop in HALOs orchestrates a multi-round pipeline: sample from the current policy, label samples with a reward model, and train on the new feedback data. Each round processes a fixed number of prompts (`PROMPTS_PER_ROUND`), with a cumulative offset (`num_skip`) ensuring prompts are not reused across rounds. The total budget is split: `NUM_ROUNDS = TOTAL_PROMPTS / PROMPTS_PER_ROUND`. Between rounds, the previous checkpoint is cleaned up to save disk space, while the SFT checkpoint is preserved as the reference model. The first round loads from the SFT checkpoint; subsequent rounds resume from the previous round's checkpoint with the SFT model as the reference.

Usage

Apply this heuristic when configuring online iterative alignment experiments. Choose:

`TOTAL_PROMPTS`: Total prompt budget for the entire training run (e.g., 2048).
`PROMPTS_PER_ROUND`: Prompts per sampling+training cycle (e.g., 512).
`num_samples_per_prompt`: Number of completions per prompt for labeling (e.g., 4).
The reward model path for labeling.
The feedback type (pairwise or binary) and mode (random, max, min).

The Insight (Rule of Thumb)

Action: Set `TOTAL_PROMPTS` and `PROMPTS_PER_ROUND` based on compute budget. Use `num_samples_per_prompt=4` for pairwise feedback (creates 2 pairs per prompt).
Value: Typical config: 2048 total prompts, 512 per round = 4 rounds. With 4 samples per prompt, each round generates 2048 candidate responses for labeling.
Trade-off: More rounds (smaller `PROMPTS_PER_ROUND`) means fresher on-policy data but more overhead from sampling and labeling. Fewer rounds (larger `PROMPTS_PER_ROUND`) is more efficient but data becomes more off-policy within each round.
Checkpoint strategy: Round 1 uses `model.load_from=SFT_CKPT`. Subsequent rounds use `model.from_checkpoint=PREV_CKPT` (resume optimizer/scheduler state) and `model.load_from=SFT_CKPT` (reference model stays at SFT). Old checkpoints are deleted to save disk space.

Reasoning

On-policy alignment requires the training data to come from the current policy's distribution. As the policy improves, data generated by older versions becomes stale. Splitting training into rounds with fresh sampling ensures the policy trains on its own distribution. The reference model is kept fixed at the SFT checkpoint across all rounds to provide a stable baseline for computing log-ratio rewards. The `num_skip` mechanism prevents prompt reuse by skipping already-processed prompts in subsequent rounds, ensuring diversity.

Code Evidence

Round budgeting configuration in `scripts/launch_llama_dpo_online.sh:5-7`:

TOTAL_PROMPTS=2048
PROMPTS_PER_ROUND=512
NUM_ROUNDS=$(($TOTAL_PROMPTS / $PROMPTS_PER_ROUND))

Cumulative prompt tracking in `scripts/launch_llama_dpo_online.sh:52-67`:

CURRENT_CKPT=$SFT_CKPT
ROUND=1
CUMULATIVE_PROMPTS=0

while [ $ROUND -le ${NUM_ROUNDS} ]; do
    python -m train.sample $CURRENT_CKPT \
        --num_samples_per_prompt 4 \
        --num_prompts ${PROMPTS_PER_ROUND} \
        --num_skip $CUMULATIVE_PROMPTS

Checkpoint evolution strategy in `scripts/launch_llama_dpo_online.sh:88-92`:

if [ $ROUND -eq 1 ]; then
    MODEL_LOAD_ARG="++model.load_from=$CURRENT_CKPT"
else
    MODEL_LOAD_ARG="++model.from_checkpoint=$CURRENT_CKPT ++model.load_from=$SFT_CKPT"
fi

Old checkpoint cleanup in `scripts/launch_llama_dpo_online.sh:112-116`:

if [ $CURRENT_CKPT != $SFT_CKPT ] && [ $SLURM_PROCID -eq 0 ]; then
    OLD_EXP_DIR=$(dirname $CURRENT_CKPT)
    echo "Cleaning up $OLD_EXP_DIR"
    rm -rf $OLD_EXP_DIR
fi

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment