Principle:ContextualAI HALOs Online Feedback Training
| Knowledge Sources | |
|---|---|
| Domains | Deep_Learning, NLP, Reinforcement_Learning |
| Last Updated | 2026-02-08 03:00 GMT |
Overview
A training mode that uses freshly generated and labeled feedback data (from the current policy's own outputs) rather than a static offline dataset.
Description
Online feedback training is a variant of preference alignment where the training data comes from the model's own recent outputs rather than a pre-existing static dataset. In each iteration round:
- The current policy generates completions (via sampling)
- A reward model or API scores these completions
- The scores are converted to preference or binary feedback
- The model trains on this fresh feedback for one pass
This differs from standard offline alignment in two key ways:
- Data source: Online data from the current policy vs. static dataset from a fixed policy
- Checkpoint resume: The optimizer and scheduler state from the previous round are loaded via
config.model.from_checkpoint, while the reference model remains fixed to the original SFT checkpoint
The get_feedback() and get_sampled_data() data loading functions handle the JSON files produced by the sampling and labeling steps, converting them into the standard Example/Dataset format.
Usage
Use online feedback training as Step 4 of the iterative alignment loop. Invoke with config.online=true and point train_datasets to the feedback JSON file from the labeling step.
Theoretical Basis
Online training addresses the distribution shift problem in offline alignment: a model trained on data generated by a different policy may behave unpredictably on its own output distribution. By training on the model's own generations, the feedback signal is on-distribution, leading to more stable and effective alignment.
The theoretical benefit is formalized in the on-policy vs. off-policy distinction:
- Off-policy: — data from a fixed behavioral policy
- On-policy: — data from the current policy
On-policy training provides unbiased gradient estimates but requires re-sampling each round.