Principle:NVIDIA NeMo Aligner REINFORCE Training
| Principle: REINFORCE Training | |
|---|---|
| Type | Principle |
| Project | NVIDIA NeMo Aligner |
| Domains | Reinforcement_Learning, NLP |
| Related | Implementation:NVIDIA_NeMo_Aligner_ReinforceTrainer_Fit |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
REINFORCE-based policy gradient training loop for language model alignment without a critic network.
Description
REINFORCE training is a simpler alternative to PPO for RLHF that uses policy gradients without a learned value function. The training loop follows these steps:
- Generate responses to prompts using the actor model.
- Score responses using the remote reward model.
- Compute advantages using the RLOO baseline (leave-one-out average of batch rewards).
- Apply KL penalty between current and reference policy to the rewards.
- Update actor weights using the REINFORCE loss (reward-weighted log probabilities).
Without a critic to train, the infrastructure is simpler:
- Only a reward model server is needed (no critic server).
- Each training step involves fewer network roundtrips (no critic inference or training calls).
- Fewer hyperparameters to tune (no critic learning rate, GAE lambda, or value function coefficient).
The RLOO variant generates multiple responses per prompt and uses the leave-one-out mean reward as a per-sample baseline, providing effective variance reduction without any learned baseline function.
Usage
Use as a simpler alternative to PPO when you want online RL alignment but with reduced infrastructure complexity.
- REINFORCE/RLOO achieves competitive results with PPO on many tasks while being easier to tune.
- Requires a trained reward model server (inference-only).
- No critic server or value function training needed.
- Particularly effective when combined with RLOO (multiple samples per prompt) for variance reduction.
- Trade-off: higher variance gradients than PPO (no value function baseline), but RLOO mitigates this.
Theoretical Basis
REINFORCE loss:
L = -E[ log pi_theta(y|x) * (r(x, y) - beta * KL - baseline) ]
RLOO baseline for sample i:
b_i = (1 / (n - 1)) * sum over j != i of r(x, y_j)
where:
n = number of responses generated per prompt
r(x, y_j) = reward for the j-th response
The total reward includes the KL penalty:
r_total = r_external - beta * KL(pi_theta || pi_ref)
Updates use standard policy gradient ascent. The RLOO baseline is an unbiased estimator that reduces variance by leveraging the correlation among rewards for responses to the same prompt.
Pseudo-code
FUNCTION reinforce_training_loop(actor, rm_client, ref_policy, dataloader, config):
FOR each training step:
# --- Rollout Phase ---
prompts = sample_batch(dataloader)
# Generate multiple responses per prompt for RLOO
all_responses = []
FOR each prompt in prompts:
FOR i in range(config.num_responses_per_prompt):
response = actor.generate(prompt)
all_responses.append(response)
# Score with reward model
rewards = rm_client.infer(prompts, all_responses)
# Compute KL penalty
actor_log_probs = actor.compute_log_probs(prompts, all_responses)
ref_log_probs = ref_policy.compute_log_probs(prompts, all_responses)
kl_penalty = actor_log_probs - ref_log_probs
adjusted_rewards = rewards - config.beta * kl_penalty
# Compute RLOO baseline
FOR each sample i:
baseline_i = mean(adjusted_rewards[j] for j != i, same prompt)
advantage_i = adjusted_rewards[i] - baseline_i
# --- Update Phase ---
loss = -mean(actor_log_probs * advantages)
update_actor(actor, loss)
RETURN actor
Related Pages
- Implementation:NVIDIA_NeMo_Aligner_ReinforceTrainer_Fit
- Heuristic:NVIDIA_NeMo_Aligner_Higher_Stability_Log_Probs
- Heuristic:NVIDIA_NeMo_Aligner_Adam_State_Offloading_Tip
- Heuristic:NVIDIA_NeMo_Aligner_PPO_NCCL_Algorithm_Setting