Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:NVIDIA NeMo Aligner REINFORCE Actor Setup

From Leeroopedia


Principle: REINFORCE Actor Setup
Type Principle
Project NVIDIA NeMo Aligner
Domains Reinforcement_Learning, NLP
Related Implementation:NVIDIA_NeMo_Aligner_MegatronGPT_Reinforce_Actor_And_RM_Client
Last Updated 2026-02-07 00:00 GMT

Overview

Initialization of the REINFORCE actor model and remote reward model client for critic-free RLHF training.

Description

REINFORCE training uses a simpler architecture than PPO by eliminating the critic network entirely. The actor model extends a pretrained GPT model with:

  • Generation capabilities -- sampling completions given prompts
  • REINFORCE policy gradient loss -- reward-weighted log probabilities instead of PPO's clipped ratio objective

Unlike PPO's actor, the REINFORCE actor computes loss as:

loss = -log_prob * (reward - baseline)

instead of the clipped ratio objective used in PPO.

The remote client communicates only with a reward model server (no critic server needed). This simplifies the infrastructure requirements compared to PPO, which needs both a critic server and a reward server.

A reference policy is maintained for KL penalty computation, identical to the PPO setup -- the frozen initial weights are stored in CPU memory and used to compute the KL divergence term in the total reward.

Usage

Use when setting up REINFORCE/RLOO training as a simpler alternative to PPO.

  • Requires a running reward model server but no critic server.
  • The actor supports optional TRT-LLM acceleration for faster generation during rollouts.
  • RLOO (REINFORCE Leave-One-Out) uses other samples in the batch as baseline instead of a learned value function, requiring no additional model infrastructure.
  • Fewer hyperparameters to tune compared to PPO (no critic learning rate, no GAE lambda, no value function coefficient).

Theoretical Basis

REINFORCE policy gradient:

gradient of J = E[ gradient of log pi_theta(a|s) * (R - b) ]

where:
  pi_theta = policy (actor model)
  R = reward from reward model
  b = baseline for variance reduction

RLOO (Leave-One-Out) baseline:

b_i = (1 / (n - 1)) * sum over j != i of R_j

where:
  n = number of samples in the batch
  R_j = reward for sample j

This eliminates critic training while maintaining variance reduction through the leave-one-out baseline. The baseline is computed purely from rewards of other samples in the same batch, requiring no learned parameters.

Pseudo-code

FUNCTION setup_reinforce_actor_and_rm_client(pretrained_model, rm_server_url):
    # Load the actor model with generation and REINFORCE capabilities
    actor = load_pretrained_model(pretrained_model)
    actor = extend_with_reinforce_loss(actor)

    # Optionally enable TRT-LLM for accelerated generation
    IF config.use_trt_llm:
        actor.init_trt_llm_generation()

    # Store reference policy weights on CPU
    reference_policy = copy_state_dict_to_cpu(actor)

    # Connect to remote reward model server (no critic needed)
    rm_client = create_http_client(rm_server_url)
    rm_client.register_endpoints(["infer"])

    RETURN actor, rm_client, reference_policy

Related Pages

Knowledge Sources

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment