Principle:NVIDIA NeMo Aligner REINFORCE Actor Setup

Principle: REINFORCE Actor Setup
Type	Principle
Project	NVIDIA NeMo Aligner
Domains	Reinforcement_Learning, NLP
Related	Implementation:NVIDIA_NeMo_Aligner_MegatronGPT_Reinforce_Actor_And_RM_Client
Last Updated	2026-02-07 00:00 GMT

Overview

Initialization of the REINFORCE actor model and remote reward model client for critic-free RLHF training.

Description

REINFORCE training uses a simpler architecture than PPO by eliminating the critic network entirely. The actor model extends a pretrained GPT model with:

Generation capabilities -- sampling completions given prompts
REINFORCE policy gradient loss -- reward-weighted log probabilities instead of PPO's clipped ratio objective

Unlike PPO's actor, the REINFORCE actor computes loss as:

loss = -log_prob * (reward - baseline)

instead of the clipped ratio objective used in PPO.

The remote client communicates only with a reward model server (no critic server needed). This simplifies the infrastructure requirements compared to PPO, which needs both a critic server and a reward server.

A reference policy is maintained for KL penalty computation, identical to the PPO setup -- the frozen initial weights are stored in CPU memory and used to compute the KL divergence term in the total reward.

Usage

Use when setting up REINFORCE/RLOO training as a simpler alternative to PPO.

Requires a running reward model server but no critic server.
The actor supports optional TRT-LLM acceleration for faster generation during rollouts.
RLOO (REINFORCE Leave-One-Out) uses other samples in the batch as baseline instead of a learned value function, requiring no additional model infrastructure.
Fewer hyperparameters to tune compared to PPO (no critic learning rate, no GAE lambda, no value function coefficient).

Theoretical Basis

REINFORCE policy gradient:

gradient of J = E[ gradient of log pi_theta(a|s) * (R - b) ]

where:
  pi_theta = policy (actor model)
  R = reward from reward model
  b = baseline for variance reduction

RLOO (Leave-One-Out) baseline:

b_i = (1 / (n - 1)) * sum over j != i of R_j

where:
  n = number of samples in the batch
  R_j = reward for sample j

This eliminates critic training while maintaining variance reduction through the leave-one-out baseline. The baseline is computed purely from rewards of other samples in the same batch, requiring no learned parameters.

Pseudo-code

FUNCTION setup_reinforce_actor_and_rm_client(pretrained_model, rm_server_url):
    # Load the actor model with generation and REINFORCE capabilities
    actor = load_pretrained_model(pretrained_model)
    actor = extend_with_reinforce_loss(actor)

    # Optionally enable TRT-LLM for accelerated generation
    IF config.use_trt_llm:
        actor.init_trt_llm_generation()

    # Store reference policy weights on CPU
    reference_policy = copy_state_dict_to_cpu(actor)

    # Connect to remote reward model server (no critic needed)
    rm_client = create_http_client(rm_server_url)
    rm_client.register_endpoints(["infer"])

    RETURN actor, rm_client, reference_policy

Related Pages

Implementation:NVIDIA_NeMo_Aligner_MegatronGPT_Reinforce_Actor_And_RM_Client

Knowledge Sources

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment