Principle:NVIDIA NeMo Aligner REINFORCE Training

Principle: REINFORCE Training
Type	Principle
Project	NVIDIA NeMo Aligner
Domains	Reinforcement_Learning, NLP
Related	Implementation:NVIDIA_NeMo_Aligner_ReinforceTrainer_Fit
Last Updated	2026-02-07 00:00 GMT

Overview

REINFORCE-based policy gradient training loop for language model alignment without a critic network.

Description

REINFORCE training is a simpler alternative to PPO for RLHF that uses policy gradients without a learned value function. The training loop follows these steps:

Generate responses to prompts using the actor model.
Score responses using the remote reward model.
Compute advantages using the RLOO baseline (leave-one-out average of batch rewards).
Apply KL penalty between current and reference policy to the rewards.
Update actor weights using the REINFORCE loss (reward-weighted log probabilities).

Without a critic to train, the infrastructure is simpler:

Only a reward model server is needed (no critic server).
Each training step involves fewer network roundtrips (no critic inference or training calls).
Fewer hyperparameters to tune (no critic learning rate, GAE lambda, or value function coefficient).

The RLOO variant generates multiple responses per prompt and uses the leave-one-out mean reward as a per-sample baseline, providing effective variance reduction without any learned baseline function.

Usage

Use as a simpler alternative to PPO when you want online RL alignment but with reduced infrastructure complexity.

REINFORCE/RLOO achieves competitive results with PPO on many tasks while being easier to tune.
Requires a trained reward model server (inference-only).
No critic server or value function training needed.
Particularly effective when combined with RLOO (multiple samples per prompt) for variance reduction.
Trade-off: higher variance gradients than PPO (no value function baseline), but RLOO mitigates this.

Theoretical Basis

REINFORCE loss:

L = -E[ log pi_theta(y|x) * (r(x, y) - beta * KL - baseline) ]

RLOO baseline for sample i:

b_i = (1 / (n - 1)) * sum over j != i of r(x, y_j)

where:
  n = number of responses generated per prompt
  r(x, y_j) = reward for the j-th response

The total reward includes the KL penalty:

r_total = r_external - beta * KL(pi_theta || pi_ref)

Updates use standard policy gradient ascent. The RLOO baseline is an unbiased estimator that reduces variance by leveraging the correlation among rewards for responses to the same prompt.

Pseudo-code

FUNCTION reinforce_training_loop(actor, rm_client, ref_policy, dataloader, config):
    FOR each training step:
        # --- Rollout Phase ---
        prompts = sample_batch(dataloader)

        # Generate multiple responses per prompt for RLOO
        all_responses = []
        FOR each prompt in prompts:
            FOR i in range(config.num_responses_per_prompt):
                response = actor.generate(prompt)
                all_responses.append(response)

        # Score with reward model
        rewards = rm_client.infer(prompts, all_responses)

        # Compute KL penalty
        actor_log_probs = actor.compute_log_probs(prompts, all_responses)
        ref_log_probs = ref_policy.compute_log_probs(prompts, all_responses)
        kl_penalty = actor_log_probs - ref_log_probs
        adjusted_rewards = rewards - config.beta * kl_penalty

        # Compute RLOO baseline
        FOR each sample i:
            baseline_i = mean(adjusted_rewards[j] for j != i, same prompt)
            advantage_i = adjusted_rewards[i] - baseline_i

        # --- Update Phase ---
        loss = -mean(actor_log_probs * advantages)
        update_actor(actor, loss)

    RETURN actor

Related Pages

Knowledge Sources

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment