Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:CarperAI Trlx Rejection Fine Tuning

From Leeroopedia


Knowledge Sources
Domains Reinforcement_Learning, NLP, Fine_Tuning
Last Updated 2026-02-07 16:00 GMT

Overview

Training method that generates multiple completions per prompt, scores them, and fine-tunes the model on the highest-scoring subset using progressive quality thresholds.

Description

Rejection Fine-Tuning (RFT) is an alternative to policy gradient methods (PPO) for aligning language models with reward signals. Instead of computing gradients through the reward function, RFT generates N completions per prompt, scores each with a reward function, selects those above a percentile threshold, and trains on the selected completions using standard supervised learning (cross-entropy loss). The percentile threshold increases progressively over training, gradually raising the quality bar as the model improves.

Usage

Use this principle when a simpler alternative to PPO is desired, or when the reward function is expensive to evaluate (since RFT evaluates rewards only during data collection, not during gradient computation). Particularly effective when the model already has reasonable performance and needs refinement.

Theoretical Basis

RFT optimizes a filtered maximum likelihood objective:

RFT=𝔼xD[yTopk(G(x))logπθ(y|x)]

where G(x)={y1,,yN}πθ(|x) are N sampled completions and Topk selects those above the score percentile threshold.

Progressive Thresholding: pt=pstart+tT(pendpstart)

where pt is the percentile threshold at step t.

Pseudo-code Logic:

# Abstract algorithm (NOT real implementation)
for step in range(n_improve_steps):
    percentile = start_percentile + step/n_steps * (end_percentile - start_percentile)
    for prompt in prompts:
        completions = generate(prompt, n=n_generations)
        scores = reward_fn(completions)
        threshold = np.percentile(scores, percentile * 100)
        best = [c for c, s in zip(completions, scores) if s >= threshold]
        train_supervised(model, prompt, best)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment