Principle:ContextualAI HALOs Online Feedback Training

Knowledge Sources	ContextualAI HALOs Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models Direct Preference Optimization
Domains	Deep_Learning, NLP, Reinforcement_Learning
Last Updated	2026-02-08 03:00 GMT

Overview

A training mode that uses freshly generated and labeled feedback data (from the current policy's own outputs) rather than a static offline dataset.

Description

Online feedback training is a variant of preference alignment where the training data comes from the model's own recent outputs rather than a pre-existing static dataset. In each iteration round:

The current policy generates completions (via sampling)
A reward model or API scores these completions
The scores are converted to preference or binary feedback
The model trains on this fresh feedback for one pass

This differs from standard offline alignment in two key ways:

Data source: Online data from the current policy vs. static dataset from a fixed policy
Checkpoint resume: The optimizer and scheduler state from the previous round are loaded via config.model.from_checkpoint, while the reference model remains fixed to the original SFT checkpoint

The get_feedback() and get_sampled_data() data loading functions handle the JSON files produced by the sampling and labeling steps, converting them into the standard Example/Dataset format.

Usage

Use online feedback training as Step 4 of the iterative alignment loop. Invoke with config.online=true and point train_datasets to the feedback JSON file from the labeling step.

Theoretical Basis

Online training addresses the distribution shift problem in offline alignment: a model trained on data generated by a different policy may behave unpredictably on its own output distribution. By training on the model's own generations, the feedback signal is on-distribution, leading to more stable and effective alignment.

The theoretical benefit is formalized in the on-policy vs. off-policy distinction:

Off-policy: $𝔼_{y \sim π_{d a t a}} [ℒ (θ)]$ — data from a fixed behavioral policy
On-policy: $𝔼_{y \sim π_{θ}} [ℒ (θ)]$ — data from the current policy

On-policy training provides unbiased gradient estimates but requires re-sampling each round.

Related Pages

Implemented By

Implementation:ContextualAI_HALOs_Online_Training_Main

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment