Principle:NVIDIA NeMo Aligner Rejection Sampling Training
| Knowledge Sources | |
|---|---|
| Domains | NLP, Alignment |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Rejection Sampling (RS) Training is an alignment approach where the policy model generates multiple candidate responses per prompt, a reward model scores each candidate, and only the highest-scoring responses are selected for supervised fine-tuning.
Description
Rejection Sampling Training combines Best-of-N sampling with Supervised Fine-Tuning (SFT). The training loop operates in two alternating phases:
- Generation phase: For each prompt in the training set, the policy model produces
num_rollouts_per_promptcandidate responses. A reward model (served as a remote service viaRemoteGPTRMClient) scores every generated response. - Selection and training phase: From the N candidates per prompt, the top-K responses (controlled by
top_n_rollouts) are selected using theselect_topkutility. The model is then fine-tuned on these selected high-quality responses using a standard language modeling loss (negative mean log-probability on response tokens).
This process repeats for each global step: generate rollouts, select the best, train on them, then move to the next batch of prompts. The loss applied during the SFT step is simply the negative masked mean of the log-probabilities over response tokens, ignoring prompt tokens via a mask constructed from prompt and response lengths.
Usage
Rejection Sampling Training is appropriate when:
- You have access to a reliable reward model that can score generated responses.
- You want a simpler alternative to PPO-based RLHF that avoids computing advantages, value functions, and policy gradients.
- You prefer an approach that only trains on high-quality outputs rather than using reward signals to adjust policy gradients directly.
- You want to iteratively improve the model by bootstrapping from its own best generations.
Theoretical Basis
Rejection Sampling Training is grounded in the Best-of-N sampling strategy. Given a prompt, the model samples N responses independently. The reward model assigns a scalar score to each response, and the top-K (where K <= N) responses are retained. The model is then updated via supervised fine-tuning on these selected responses.
Formally, for a prompt x, the model generates responses y_1, y_2, ..., y_N ~ pi(.|x). The reward model computes r(x, y_i) for each response. The training set is then:
S = { (x, y_i) : y_i in top-K({ y_1, ..., y_N }, r) }
The SFT loss on the selected responses is:
L = -E_{(x,y) in S} [ (1/|y|) * sum_{t} log pi(y_t | x, y_{<t}) ]
This approach is equivalent to performing importance-weighted SFT where the importance weight is an indicator function selecting only the best responses. Over multiple iterations, the policy distribution shifts toward generating responses that score highly under the reward model.
The key advantage over PPO is simplicity: there is no need for a critic network, advantage estimation, or clipping. The key trade-off is sample efficiency, since many generated responses are discarded.