Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:ContextualAI HALOs Online Loop Script

From Leeroopedia


Knowledge Sources
Domains NLP, Reinforcement_Learning, Infrastructure
Last Updated 2026-02-08 03:00 GMT

Overview

Concrete tool for orchestrating the sample-label-train iteration cycle provided by shell scripts.

Description

The online loop scripts (e.g., launch_llama_dpo_online.sh, launch_llama_instruct_kto_online.sh, launch_llama_instruct_grpo_online.sh) implement the iterative alignment pattern as bash while-loops that orchestrate three separate commands per round:

  1. Sample: python -m train.sample with the current round's model checkpoint
  2. Label: accelerate launch -m train.label to score samples with the reward model
  3. Train: accelerate launch launch.py with online=true and the labeled feedback

Each round uses a subset of prompts (PROMPTS_PER_ROUND) and skips previously used prompts via --num_skip. After training completes, old checkpoints are cleaned up.

Usage

Run the appropriate script for your alignment method: bash scripts/launch_llama_dpo_online.sh 0.1 5e-6 where arguments are BETA and LR.

Code Reference

Source Location

  • Repository: ContextualAI/HALOs
  • File: scripts/launch_llama_dpo_online.sh (DPO), scripts/launch_llama_instruct_kto_online.sh (KTO), scripts/launch_llama_instruct_grpo_online.sh (GRPO)
  • Lines: scripts/launch_llama_dpo_online.sh:L54-124 (loop body)

Signature

#!/bin/bash
# Arguments: BETA LR
BETA=$1
LR=$2

# Configuration
TOTAL_PROMPTS=2048
PROMPTS_PER_ROUND=512
NUM_ROUNDS=$((TOTAL_PROMPTS / PROMPTS_PER_ROUND))
ROUND=1

while [ $ROUND -le $NUM_ROUNDS ]; do
    NUM_SKIP=$(( (ROUND - 1) * PROMPTS_PER_ROUND ))

    # Step 1: Sample from current policy
    python -m train.sample $CURRENT_CKPT \
        --datasets ultrafeedback_armorm \
        --num_samples_per_prompt 4 \
        --num_prompts $PROMPTS_PER_ROUND \
        --num_skip $NUM_SKIP \
        --output_file round${ROUND}_samples.json

    # Step 2: Label with reward model
    accelerate launch -m train.label \
        --reward_model_path $REWARD_CKPT \
        --feedback_type pairwise \
        round${ROUND}_samples.json round${ROUND}_feedback.json

    # Step 3: Train on feedback
    accelerate launch launch.py \
        loss=dpo model=llama \
        train_datasets=[round${ROUND}_feedback.json] \
        ++online=true \
        ++model.load_from=$SFT_CKPT \
        ++model.from_checkpoint=$CURRENT_CKPT

    # Cleanup and advance
    CURRENT_CKPT=$NEW_CKPT
    ROUND=$((ROUND + 1))
done

Import

bash scripts/launch_llama_dpo_online.sh 0.1 5e-6

I/O Contract

Inputs

Name Type Required Description
BETA float Yes KL penalty weight (positional arg 1)
LR float Yes Learning rate (positional arg 2)
SFT_CKPT str Yes Path to SFT model checkpoint (hardcoded in script)
REWARD_CKPT str Yes Path to reward model checkpoint (hardcoded in script)
TOTAL_PROMPTS int Yes Total prompt budget across all rounds
PROMPTS_PER_ROUND int Yes Prompts sampled per round

Outputs

Name Type Description
Final model Directory Aligned model after all rounds
Per-round samples JSON round{N}_samples.json files (cleaned up after each round)
Per-round feedback JSON round{N}_feedback.json files (cleaned up after each round)

Usage Examples

DPO Online Loop

# Run 4-round DPO online alignment with beta=0.1, lr=5e-6
bash scripts/launch_llama_dpo_online.sh 0.1 5e-6

KTO Online Loop

# Run KTO online alignment with humanline clamping
bash scripts/launch_llama_instruct_kto_online.sh 0.1 5e-6

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment