Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:NVIDIA NeMo Aligner REINFORCE Training

From Leeroopedia


Principle: REINFORCE Training
Type Principle
Project NVIDIA NeMo Aligner
Domains Reinforcement_Learning, NLP
Related Implementation:NVIDIA_NeMo_Aligner_ReinforceTrainer_Fit
Last Updated 2026-02-07 00:00 GMT

Overview

REINFORCE-based policy gradient training loop for language model alignment without a critic network.

Description

REINFORCE training is a simpler alternative to PPO for RLHF that uses policy gradients without a learned value function. The training loop follows these steps:

  1. Generate responses to prompts using the actor model.
  2. Score responses using the remote reward model.
  3. Compute advantages using the RLOO baseline (leave-one-out average of batch rewards).
  4. Apply KL penalty between current and reference policy to the rewards.
  5. Update actor weights using the REINFORCE loss (reward-weighted log probabilities).

Without a critic to train, the infrastructure is simpler:

  • Only a reward model server is needed (no critic server).
  • Each training step involves fewer network roundtrips (no critic inference or training calls).
  • Fewer hyperparameters to tune (no critic learning rate, GAE lambda, or value function coefficient).

The RLOO variant generates multiple responses per prompt and uses the leave-one-out mean reward as a per-sample baseline, providing effective variance reduction without any learned baseline function.

Usage

Use as a simpler alternative to PPO when you want online RL alignment but with reduced infrastructure complexity.

  • REINFORCE/RLOO achieves competitive results with PPO on many tasks while being easier to tune.
  • Requires a trained reward model server (inference-only).
  • No critic server or value function training needed.
  • Particularly effective when combined with RLOO (multiple samples per prompt) for variance reduction.
  • Trade-off: higher variance gradients than PPO (no value function baseline), but RLOO mitigates this.

Theoretical Basis

REINFORCE loss:

L = -E[ log pi_theta(y|x) * (r(x, y) - beta * KL - baseline) ]

RLOO baseline for sample i:

b_i = (1 / (n - 1)) * sum over j != i of r(x, y_j)

where:
  n = number of responses generated per prompt
  r(x, y_j) = reward for the j-th response

The total reward includes the KL penalty:

r_total = r_external - beta * KL(pi_theta || pi_ref)

Updates use standard policy gradient ascent. The RLOO baseline is an unbiased estimator that reduces variance by leveraging the correlation among rewards for responses to the same prompt.

Pseudo-code

FUNCTION reinforce_training_loop(actor, rm_client, ref_policy, dataloader, config):
    FOR each training step:
        # --- Rollout Phase ---
        prompts = sample_batch(dataloader)

        # Generate multiple responses per prompt for RLOO
        all_responses = []
        FOR each prompt in prompts:
            FOR i in range(config.num_responses_per_prompt):
                response = actor.generate(prompt)
                all_responses.append(response)

        # Score with reward model
        rewards = rm_client.infer(prompts, all_responses)

        # Compute KL penalty
        actor_log_probs = actor.compute_log_probs(prompts, all_responses)
        ref_log_probs = ref_policy.compute_log_probs(prompts, all_responses)
        kl_penalty = actor_log_probs - ref_log_probs
        adjusted_rewards = rewards - config.beta * kl_penalty

        # Compute RLOO baseline
        FOR each sample i:
            baseline_i = mean(adjusted_rewards[j] for j != i, same prompt)
            advantage_i = adjusted_rewards[i] - baseline_i

        # --- Update Phase ---
        loss = -mean(actor_log_probs * advantages)
        update_actor(actor, loss)

    RETURN actor

Related Pages

Knowledge Sources

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment