Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:ContextualAI HALOs Feedback Labeling

From Leeroopedia


Knowledge Sources
Domains NLP, Reinforcement_Learning, Data_Engineering
Last Updated 2026-02-08 03:00 GMT

Overview

A scoring and feedback conversion pipeline that assigns reward scores to model completions and transforms them into training-ready preference or binary feedback formats.

Description

Feedback labeling is the bridge between model sampling and alignment training in the online iterative loop. Given a set of model-generated completions, the labeling step produces structured feedback suitable for training.

The HALOs framework supports two labeling backends:

  • Reward model scoring — A trained Bradley-Terry reward model scores each completion. Distributed across GPUs via Accelerate.
  • API-based scoring — An external LLM (e.g., GPT-4) evaluates completions via async API calls.

After scoring, the raw reward values are converted into one of three feedback formats:

  • Pairwise feedback — Pairs of completions for the same prompt, with a label indicating which is preferred. Three pairing modes: random (shuffle and pair), max (best vs. worst), min (closest above threshold).
  • Binary feedback — Each completion labeled as desirable or undesirable based on a threshold (mean, median, or numeric).
  • Scalar feedback — Raw reward scores passed through without conversion.

Usage

Use feedback labeling in the online iterative alignment loop (Step 3) to convert sampled completions into training data. The feedback format should match the target alignment method: pairwise for DPO, binary for KTO, scalar for GRPO.

Theoretical Basis

Pairwise Feedback Construction

Given samples {(xi,yi,1,yi,2,...,yi,k)} with rewards ri,j, construct pairs:

  • Max mode: For each prompt, pair the highest and lowest scoring completions
  • Random mode: Randomly pair completions from the same prompt
  • Min mode: Find the pair with the smallest reward difference above the threshold

The preference label is l=𝟙[rA>rB] with 50% random swap to remove position bias.

Binary Feedback Construction

Given a threshold τ (mean, median, or fixed): label(y)=𝟙[r(y)τ]

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment