Implementation:ContextualAI HALOs Online Training Main

Knowledge Sources	ContextualAI HALOs
Domains	Deep_Learning, NLP, Reinforcement_Learning
Last Updated	2026-02-08 03:00 GMT

Overview

Concrete tool for training on freshly labeled feedback data provided by the launch.py main function in online mode.

Description

The online training mode reuses the same main(config) entry point in launch.py but with config.online=true. The key differences from offline training:

Data loading: Uses get_feedback() or get_sampled_data() to load JSON files produced by the labeling step, rather than HuggingFace datasets
Checkpoint resume: Loads optimizer and scheduler state from a previous round's checkpoint via config.model.from_checkpoint
Reference model: Always fixed to the original SFT checkpoint via config.model.load_from
Single pass: Typically trained for one epoch per round to prevent overfitting on the small per-round dataset

Usage

Invoke via accelerate launch launch.py loss={method} model=llama train_datasets=[feedback.json] ++online=true ++model.from_checkpoint=/round_N/FINAL ++model.load_from=/sft/FINAL.

Code Reference

Source Location

Repository: ContextualAI/HALOs
File: launch.py (main), train/data.py (get_feedback, get_sampled_data)
Lines: launch.py:L42-331 (main), train/data.py:L165-188 (get_sampled_data), train/data.py:L191-284 (get_feedback)

Signature

def main(config: DictConfig) -> None:
    """Main entry point with online=true mode.

    Key config parameters for online mode:
        config.online: bool = True
        config.model.from_checkpoint: str  # Previous round checkpoint (optimizer/scheduler)
        config.model.load_from: str        # SFT checkpoint (reference model)
        train_datasets: List[str]          # Path to feedback JSON file
    """

def get_sampled_data(split: str, ...) -> Dataset:
    """Load sampled data from JSON (output of train.sample)."""

def get_feedback(split: str, ...) -> Dataset:
    """Load labeled feedback from JSON (output of train.label).
    Handles pairwise_feedback, binary_feedback, and scalar_feedback types.
    """

Import

# Run as CLI:
# accelerate launch launch.py loss=dpo model=llama \
#     train_datasets=[feedback.json] ++online=true

I/O Contract

Inputs

Name	Type	Required	Description
config.online	bool	Yes	Must be true for online mode
train_datasets	List[str]	Yes	Path(s) to feedback JSON from labeling step
config.model.from_checkpoint	str	No	Previous round checkpoint for optimizer/scheduler resume
config.model.load_from	str	Yes	SFT checkpoint path (reference model stays fixed)
config.loss	str	Yes	Alignment method (dpo, kto, grpo, etc.)

Outputs

Name	Type	Description
Model checkpoint	Directory	Updated model saved to {cache_dir}/{exp_name}/FINAL/
Optimizer state	File	Saved for next round's checkpoint resume
Training metrics	Dict	Per-step loss and reward metrics

Usage Examples

Online DPO Round

# Train on pairwise feedback from round 1
accelerate launch \
    --config_file accelerate_config/fsdp_4gpu.yaml \
    launch.py \
    loss=dpo \
    model=llama \
    train_datasets=[round1_feedback.json] \
    exp_name=llama3-8B-dpo-round1 \
    ++online=true \
    ++model.load_from=/models/llama3-8B-sft/FINAL \
    ++model.name_or_path=meta-llama/Meta-Llama-3-8B

Online KTO Round with Checkpoint Resume

# Resume from round 1 checkpoint for round 2
accelerate launch \
    --config_file accelerate_config/fsdp_4gpu.yaml \
    launch.py \
    loss=kto \
    model=llama \
    train_datasets=[round2_feedback.json] \
    exp_name=llama3-8B-kto-round2 \
    ++online=true \
    ++model.load_from=/models/llama3-8B-sft/FINAL \
    ++model.from_checkpoint=/models/llama3-8B-kto-round1/FINAL

Related Pages

Implements Principle

Principle:ContextualAI_HALOs_Online_Feedback_Training

Requires Environment

Environment:ContextualAI_HALOs_CUDA_12_1_Training_Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment