Implementation:ContextualAI HALOs Online Training Main
Appearance
| Knowledge Sources | |
|---|---|
| Domains | Deep_Learning, NLP, Reinforcement_Learning |
| Last Updated | 2026-02-08 03:00 GMT |
Overview
Concrete tool for training on freshly labeled feedback data provided by the launch.py main function in online mode.
Description
The online training mode reuses the same main(config) entry point in launch.py but with config.online=true. The key differences from offline training:
- Data loading: Uses
get_feedback()orget_sampled_data()to load JSON files produced by the labeling step, rather than HuggingFace datasets - Checkpoint resume: Loads optimizer and scheduler state from a previous round's checkpoint via
config.model.from_checkpoint - Reference model: Always fixed to the original SFT checkpoint via
config.model.load_from - Single pass: Typically trained for one epoch per round to prevent overfitting on the small per-round dataset
Usage
Invoke via accelerate launch launch.py loss={method} model=llama train_datasets=[feedback.json] ++online=true ++model.from_checkpoint=/round_N/FINAL ++model.load_from=/sft/FINAL.
Code Reference
Source Location
- Repository: ContextualAI/HALOs
- File: launch.py (main), train/data.py (get_feedback, get_sampled_data)
- Lines: launch.py:L42-331 (main), train/data.py:L165-188 (get_sampled_data), train/data.py:L191-284 (get_feedback)
Signature
def main(config: DictConfig) -> None:
"""Main entry point with online=true mode.
Key config parameters for online mode:
config.online: bool = True
config.model.from_checkpoint: str # Previous round checkpoint (optimizer/scheduler)
config.model.load_from: str # SFT checkpoint (reference model)
train_datasets: List[str] # Path to feedback JSON file
"""
def get_sampled_data(split: str, ...) -> Dataset:
"""Load sampled data from JSON (output of train.sample)."""
def get_feedback(split: str, ...) -> Dataset:
"""Load labeled feedback from JSON (output of train.label).
Handles pairwise_feedback, binary_feedback, and scalar_feedback types.
"""
Import
# Run as CLI:
# accelerate launch launch.py loss=dpo model=llama \
# train_datasets=[feedback.json] ++online=true
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| config.online | bool | Yes | Must be true for online mode |
| train_datasets | List[str] | Yes | Path(s) to feedback JSON from labeling step |
| config.model.from_checkpoint | str | No | Previous round checkpoint for optimizer/scheduler resume |
| config.model.load_from | str | Yes | SFT checkpoint path (reference model stays fixed) |
| config.loss | str | Yes | Alignment method (dpo, kto, grpo, etc.) |
Outputs
| Name | Type | Description |
|---|---|---|
| Model checkpoint | Directory | Updated model saved to {cache_dir}/{exp_name}/FINAL/ |
| Optimizer state | File | Saved for next round's checkpoint resume |
| Training metrics | Dict | Per-step loss and reward metrics |
Usage Examples
Online DPO Round
# Train on pairwise feedback from round 1
accelerate launch \
--config_file accelerate_config/fsdp_4gpu.yaml \
launch.py \
loss=dpo \
model=llama \
train_datasets=[round1_feedback.json] \
exp_name=llama3-8B-dpo-round1 \
++online=true \
++model.load_from=/models/llama3-8B-sft/FINAL \
++model.name_or_path=meta-llama/Meta-Llama-3-8B
Online KTO Round with Checkpoint Resume
# Resume from round 1 checkpoint for round 2
accelerate launch \
--config_file accelerate_config/fsdp_4gpu.yaml \
launch.py \
loss=kto \
model=llama \
train_datasets=[round2_feedback.json] \
exp_name=llama3-8B-kto-round2 \
++online=true \
++model.load_from=/models/llama3-8B-sft/FINAL \
++model.from_checkpoint=/models/llama3-8B-kto-round1/FINAL
Related Pages
Implements Principle
Requires Environment
Uses Heuristic
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment