Principle:CarperAI Trlx Rejection Fine Tuning
| Knowledge Sources | |
|---|---|
| Domains | Reinforcement_Learning, NLP, Fine_Tuning |
| Last Updated | 2026-02-07 16:00 GMT |
Overview
Training method that generates multiple completions per prompt, scores them, and fine-tunes the model on the highest-scoring subset using progressive quality thresholds.
Description
Rejection Fine-Tuning (RFT) is an alternative to policy gradient methods (PPO) for aligning language models with reward signals. Instead of computing gradients through the reward function, RFT generates N completions per prompt, scores each with a reward function, selects those above a percentile threshold, and trains on the selected completions using standard supervised learning (cross-entropy loss). The percentile threshold increases progressively over training, gradually raising the quality bar as the model improves.
Usage
Use this principle when a simpler alternative to PPO is desired, or when the reward function is expensive to evaluate (since RFT evaluates rewards only during data collection, not during gradient computation). Particularly effective when the model already has reasonable performance and needs refinement.
Theoretical Basis
RFT optimizes a filtered maximum likelihood objective:
where are N sampled completions and selects those above the score percentile threshold.
Progressive Thresholding:
where is the percentile threshold at step .
Pseudo-code Logic:
# Abstract algorithm (NOT real implementation)
for step in range(n_improve_steps):
percentile = start_percentile + step/n_steps * (end_percentile - start_percentile)
for prompt in prompts:
completions = generate(prompt, n=n_generations)
scores = reward_fn(completions)
threshold = np.percentile(scores, percentile * 100)
best = [c for c, s in zip(completions, scores) if s >= threshold]
train_supervised(model, prompt, best)