Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:ContextualAI HALOs Online Feedback Training

From Leeroopedia


Knowledge Sources
Domains Deep_Learning, NLP, Reinforcement_Learning
Last Updated 2026-02-08 03:00 GMT

Overview

A training mode that uses freshly generated and labeled feedback data (from the current policy's own outputs) rather than a static offline dataset.

Description

Online feedback training is a variant of preference alignment where the training data comes from the model's own recent outputs rather than a pre-existing static dataset. In each iteration round:

  1. The current policy generates completions (via sampling)
  2. A reward model or API scores these completions
  3. The scores are converted to preference or binary feedback
  4. The model trains on this fresh feedback for one pass

This differs from standard offline alignment in two key ways:

  • Data source: Online data from the current policy vs. static dataset from a fixed policy
  • Checkpoint resume: The optimizer and scheduler state from the previous round are loaded via config.model.from_checkpoint, while the reference model remains fixed to the original SFT checkpoint

The get_feedback() and get_sampled_data() data loading functions handle the JSON files produced by the sampling and labeling steps, converting them into the standard Example/Dataset format.

Usage

Use online feedback training as Step 4 of the iterative alignment loop. Invoke with config.online=true and point train_datasets to the feedback JSON file from the labeling step.

Theoretical Basis

Online training addresses the distribution shift problem in offline alignment: a model trained on data generated by a different policy may behave unpredictably on its own output distribution. By training on the model's own generations, the feedback signal is on-distribution, leading to more stable and effective alignment.

The theoretical benefit is formalized in the on-policy vs. off-policy distinction:

  • Off-policy: 𝔼yπdata[(θ)] — data from a fixed behavioral policy
  • On-policy: 𝔼yπθ[(θ)] — data from the current policy

On-policy training provides unbiased gradient estimates but requires re-sampling each round.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment