Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:ContextualAI HALOs Online Training Main

From Leeroopedia


Knowledge Sources
Domains Deep_Learning, NLP, Reinforcement_Learning
Last Updated 2026-02-08 03:00 GMT

Overview

Concrete tool for training on freshly labeled feedback data provided by the launch.py main function in online mode.

Description

The online training mode reuses the same main(config) entry point in launch.py but with config.online=true. The key differences from offline training:

  • Data loading: Uses get_feedback() or get_sampled_data() to load JSON files produced by the labeling step, rather than HuggingFace datasets
  • Checkpoint resume: Loads optimizer and scheduler state from a previous round's checkpoint via config.model.from_checkpoint
  • Reference model: Always fixed to the original SFT checkpoint via config.model.load_from
  • Single pass: Typically trained for one epoch per round to prevent overfitting on the small per-round dataset

Usage

Invoke via accelerate launch launch.py loss={method} model=llama train_datasets=[feedback.json] ++online=true ++model.from_checkpoint=/round_N/FINAL ++model.load_from=/sft/FINAL.

Code Reference

Source Location

  • Repository: ContextualAI/HALOs
  • File: launch.py (main), train/data.py (get_feedback, get_sampled_data)
  • Lines: launch.py:L42-331 (main), train/data.py:L165-188 (get_sampled_data), train/data.py:L191-284 (get_feedback)

Signature

def main(config: DictConfig) -> None:
    """Main entry point with online=true mode.

    Key config parameters for online mode:
        config.online: bool = True
        config.model.from_checkpoint: str  # Previous round checkpoint (optimizer/scheduler)
        config.model.load_from: str        # SFT checkpoint (reference model)
        train_datasets: List[str]          # Path to feedback JSON file
    """

def get_sampled_data(split: str, ...) -> Dataset:
    """Load sampled data from JSON (output of train.sample)."""

def get_feedback(split: str, ...) -> Dataset:
    """Load labeled feedback from JSON (output of train.label).
    Handles pairwise_feedback, binary_feedback, and scalar_feedback types.
    """

Import

# Run as CLI:
# accelerate launch launch.py loss=dpo model=llama \
#     train_datasets=[feedback.json] ++online=true

I/O Contract

Inputs

Name Type Required Description
config.online bool Yes Must be true for online mode
train_datasets List[str] Yes Path(s) to feedback JSON from labeling step
config.model.from_checkpoint str No Previous round checkpoint for optimizer/scheduler resume
config.model.load_from str Yes SFT checkpoint path (reference model stays fixed)
config.loss str Yes Alignment method (dpo, kto, grpo, etc.)

Outputs

Name Type Description
Model checkpoint Directory Updated model saved to {cache_dir}/{exp_name}/FINAL/
Optimizer state File Saved for next round's checkpoint resume
Training metrics Dict Per-step loss and reward metrics

Usage Examples

Online DPO Round

# Train on pairwise feedback from round 1
accelerate launch \
    --config_file accelerate_config/fsdp_4gpu.yaml \
    launch.py \
    loss=dpo \
    model=llama \
    train_datasets=[round1_feedback.json] \
    exp_name=llama3-8B-dpo-round1 \
    ++online=true \
    ++model.load_from=/models/llama3-8B-sft/FINAL \
    ++model.name_or_path=meta-llama/Meta-Llama-3-8B

Online KTO Round with Checkpoint Resume

# Resume from round 1 checkpoint for round 2
accelerate launch \
    --config_file accelerate_config/fsdp_4gpu.yaml \
    launch.py \
    loss=kto \
    model=llama \
    train_datasets=[round2_feedback.json] \
    exp_name=llama3-8B-kto-round2 \
    ++online=true \
    ++model.load_from=/models/llama3-8B-sft/FINAL \
    ++model.from_checkpoint=/models/llama3-8B-kto-round1/FINAL

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment