Implementation:ContextualAI HALOs Online Loop Script

Knowledge Sources	ContextualAI HALOs
Domains	NLP, Reinforcement_Learning, Infrastructure
Last Updated	2026-02-08 03:00 GMT

Overview

Concrete tool for orchestrating the sample-label-train iteration cycle provided by shell scripts.

Description

The online loop scripts (e.g., launch_llama_dpo_online.sh, launch_llama_instruct_kto_online.sh, launch_llama_instruct_grpo_online.sh) implement the iterative alignment pattern as bash while-loops that orchestrate three separate commands per round:

Sample: python -m train.sample with the current round's model checkpoint
Label: accelerate launch -m train.label to score samples with the reward model
Train: accelerate launch launch.py with online=true and the labeled feedback

Each round uses a subset of prompts (PROMPTS_PER_ROUND) and skips previously used prompts via --num_skip. After training completes, old checkpoints are cleaned up.

Usage

Run the appropriate script for your alignment method: bash scripts/launch_llama_dpo_online.sh 0.1 5e-6 where arguments are BETA and LR.

Code Reference

Source Location

Repository: ContextualAI/HALOs
File: scripts/launch_llama_dpo_online.sh (DPO), scripts/launch_llama_instruct_kto_online.sh (KTO), scripts/launch_llama_instruct_grpo_online.sh (GRPO)
Lines: scripts/launch_llama_dpo_online.sh:L54-124 (loop body)

Signature

#!/bin/bash
# Arguments: BETA LR
BETA=$1
LR=$2

# Configuration
TOTAL_PROMPTS=2048
PROMPTS_PER_ROUND=512
NUM_ROUNDS=$((TOTAL_PROMPTS / PROMPTS_PER_ROUND))
ROUND=1

while [ $ROUND -le $NUM_ROUNDS ]; do
    NUM_SKIP=$(( (ROUND - 1) * PROMPTS_PER_ROUND ))

    # Step 1: Sample from current policy
    python -m train.sample $CURRENT_CKPT \
        --datasets ultrafeedback_armorm \
        --num_samples_per_prompt 4 \
        --num_prompts $PROMPTS_PER_ROUND \
        --num_skip $NUM_SKIP \
        --output_file round${ROUND}_samples.json

    # Step 2: Label with reward model
    accelerate launch -m train.label \
        --reward_model_path $REWARD_CKPT \
        --feedback_type pairwise \
        round${ROUND}_samples.json round${ROUND}_feedback.json

    # Step 3: Train on feedback
    accelerate launch launch.py \
        loss=dpo model=llama \
        train_datasets=[round${ROUND}_feedback.json] \
        ++online=true \
        ++model.load_from=$SFT_CKPT \
        ++model.from_checkpoint=$CURRENT_CKPT

    # Cleanup and advance
    CURRENT_CKPT=$NEW_CKPT
    ROUND=$((ROUND + 1))
done

Import

bash scripts/launch_llama_dpo_online.sh 0.1 5e-6

I/O Contract

Inputs

Name	Type	Required	Description
BETA	float	Yes	KL penalty weight (positional arg 1)
LR	float	Yes	Learning rate (positional arg 2)
SFT_CKPT	str	Yes	Path to SFT model checkpoint (hardcoded in script)
REWARD_CKPT	str	Yes	Path to reward model checkpoint (hardcoded in script)
TOTAL_PROMPTS	int	Yes	Total prompt budget across all rounds
PROMPTS_PER_ROUND	int	Yes	Prompts sampled per round

Outputs

Name	Type	Description
Final model	Directory	Aligned model after all rounds
Per-round samples	JSON	round{N}_samples.json files (cleaned up after each round)
Per-round feedback	JSON	round{N}_feedback.json files (cleaned up after each round)

Usage Examples

DPO Online Loop

# Run 4-round DPO online alignment with beta=0.1, lr=5e-6
bash scripts/launch_llama_dpo_online.sh 0.1 5e-6

KTO Online Loop

# Run KTO online alignment with humanline clamping
bash scripts/launch_llama_instruct_kto_online.sh 0.1 5e-6

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment