Principle:ContextualAI HALOs Feedback Labeling

Knowledge Sources	ContextualAI HALOs RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback Judging LLM-as-a-Judge
Domains	NLP, Reinforcement_Learning, Data_Engineering
Last Updated	2026-02-08 03:00 GMT

Overview

A scoring and feedback conversion pipeline that assigns reward scores to model completions and transforms them into training-ready preference or binary feedback formats.

Description

Feedback labeling is the bridge between model sampling and alignment training in the online iterative loop. Given a set of model-generated completions, the labeling step produces structured feedback suitable for training.

The HALOs framework supports two labeling backends:

Reward model scoring — A trained Bradley-Terry reward model scores each completion. Distributed across GPUs via Accelerate.
API-based scoring — An external LLM (e.g., GPT-4) evaluates completions via async API calls.

After scoring, the raw reward values are converted into one of three feedback formats:

Pairwise feedback — Pairs of completions for the same prompt, with a label indicating which is preferred. Three pairing modes: random (shuffle and pair), max (best vs. worst), min (closest above threshold).
Binary feedback — Each completion labeled as desirable or undesirable based on a threshold (mean, median, or numeric).
Scalar feedback — Raw reward scores passed through without conversion.

Usage

Use feedback labeling in the online iterative alignment loop (Step 3) to convert sampled completions into training data. The feedback format should match the target alignment method: pairwise for DPO, binary for KTO, scalar for GRPO.

Theoretical Basis

Pairwise Feedback Construction

Given samples ${(x_{i}, y_{i, 1}, y_{i, 2}, . . ., y_{i, k})}$ with rewards $r_{i, j}$ , construct pairs:

Max mode: For each prompt, pair the highest and lowest scoring completions
Random mode: Randomly pair completions from the same prompt
Min mode: Find the pair with the smallest reward difference above the threshold

The preference label is $l = 𝟙 [r_{A} > r_{B}]$ with 50% random swap to remove position bias.

Binary Feedback Construction

Given a threshold $τ$ (mean, median, or fixed): $label (y) = 𝟙 [r (y) \geq τ]$

Related Pages

Implemented By

Implementation:ContextualAI_HALOs_Label_Main

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment