Implementation:ContextualAI HALOs Online Loop Script
| Knowledge Sources | |
|---|---|
| Domains | NLP, Reinforcement_Learning, Infrastructure |
| Last Updated | 2026-02-08 03:00 GMT |
Overview
Concrete tool for orchestrating the sample-label-train iteration cycle provided by shell scripts.
Description
The online loop scripts (e.g., launch_llama_dpo_online.sh, launch_llama_instruct_kto_online.sh, launch_llama_instruct_grpo_online.sh) implement the iterative alignment pattern as bash while-loops that orchestrate three separate commands per round:
- Sample:
python -m train.samplewith the current round's model checkpoint - Label:
accelerate launch -m train.labelto score samples with the reward model - Train:
accelerate launch launch.pywithonline=trueand the labeled feedback
Each round uses a subset of prompts (PROMPTS_PER_ROUND) and skips previously used prompts via --num_skip. After training completes, old checkpoints are cleaned up.
Usage
Run the appropriate script for your alignment method: bash scripts/launch_llama_dpo_online.sh 0.1 5e-6 where arguments are BETA and LR.
Code Reference
Source Location
- Repository: ContextualAI/HALOs
- File: scripts/launch_llama_dpo_online.sh (DPO), scripts/launch_llama_instruct_kto_online.sh (KTO), scripts/launch_llama_instruct_grpo_online.sh (GRPO)
- Lines: scripts/launch_llama_dpo_online.sh:L54-124 (loop body)
Signature
#!/bin/bash
# Arguments: BETA LR
BETA=$1
LR=$2
# Configuration
TOTAL_PROMPTS=2048
PROMPTS_PER_ROUND=512
NUM_ROUNDS=$((TOTAL_PROMPTS / PROMPTS_PER_ROUND))
ROUND=1
while [ $ROUND -le $NUM_ROUNDS ]; do
NUM_SKIP=$(( (ROUND - 1) * PROMPTS_PER_ROUND ))
# Step 1: Sample from current policy
python -m train.sample $CURRENT_CKPT \
--datasets ultrafeedback_armorm \
--num_samples_per_prompt 4 \
--num_prompts $PROMPTS_PER_ROUND \
--num_skip $NUM_SKIP \
--output_file round${ROUND}_samples.json
# Step 2: Label with reward model
accelerate launch -m train.label \
--reward_model_path $REWARD_CKPT \
--feedback_type pairwise \
round${ROUND}_samples.json round${ROUND}_feedback.json
# Step 3: Train on feedback
accelerate launch launch.py \
loss=dpo model=llama \
train_datasets=[round${ROUND}_feedback.json] \
++online=true \
++model.load_from=$SFT_CKPT \
++model.from_checkpoint=$CURRENT_CKPT
# Cleanup and advance
CURRENT_CKPT=$NEW_CKPT
ROUND=$((ROUND + 1))
done
Import
bash scripts/launch_llama_dpo_online.sh 0.1 5e-6
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| BETA | float | Yes | KL penalty weight (positional arg 1) |
| LR | float | Yes | Learning rate (positional arg 2) |
| SFT_CKPT | str | Yes | Path to SFT model checkpoint (hardcoded in script) |
| REWARD_CKPT | str | Yes | Path to reward model checkpoint (hardcoded in script) |
| TOTAL_PROMPTS | int | Yes | Total prompt budget across all rounds |
| PROMPTS_PER_ROUND | int | Yes | Prompts sampled per round |
Outputs
| Name | Type | Description |
|---|---|---|
| Final model | Directory | Aligned model after all rounds |
| Per-round samples | JSON | round{N}_samples.json files (cleaned up after each round) |
| Per-round feedback | JSON | round{N}_feedback.json files (cleaned up after each round) |
Usage Examples
DPO Online Loop
# Run 4-round DPO online alignment with beta=0.1, lr=5e-6
bash scripts/launch_llama_dpo_online.sh 0.1 5e-6
KTO Online Loop
# Run KTO online alignment with humanline clamping
bash scripts/launch_llama_instruct_kto_online.sh 0.1 5e-6