Principle:NVIDIA NeMo Aligner REINFORCE Actor Setup
| Principle: REINFORCE Actor Setup | |
|---|---|
| Type | Principle |
| Project | NVIDIA NeMo Aligner |
| Domains | Reinforcement_Learning, NLP |
| Related | Implementation:NVIDIA_NeMo_Aligner_MegatronGPT_Reinforce_Actor_And_RM_Client |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Initialization of the REINFORCE actor model and remote reward model client for critic-free RLHF training.
Description
REINFORCE training uses a simpler architecture than PPO by eliminating the critic network entirely. The actor model extends a pretrained GPT model with:
- Generation capabilities -- sampling completions given prompts
- REINFORCE policy gradient loss -- reward-weighted log probabilities instead of PPO's clipped ratio objective
Unlike PPO's actor, the REINFORCE actor computes loss as:
loss = -log_prob * (reward - baseline)
instead of the clipped ratio objective used in PPO.
The remote client communicates only with a reward model server (no critic server needed). This simplifies the infrastructure requirements compared to PPO, which needs both a critic server and a reward server.
A reference policy is maintained for KL penalty computation, identical to the PPO setup -- the frozen initial weights are stored in CPU memory and used to compute the KL divergence term in the total reward.
Usage
Use when setting up REINFORCE/RLOO training as a simpler alternative to PPO.
- Requires a running reward model server but no critic server.
- The actor supports optional TRT-LLM acceleration for faster generation during rollouts.
- RLOO (REINFORCE Leave-One-Out) uses other samples in the batch as baseline instead of a learned value function, requiring no additional model infrastructure.
- Fewer hyperparameters to tune compared to PPO (no critic learning rate, no GAE lambda, no value function coefficient).
Theoretical Basis
REINFORCE policy gradient:
gradient of J = E[ gradient of log pi_theta(a|s) * (R - b) ]
where:
pi_theta = policy (actor model)
R = reward from reward model
b = baseline for variance reduction
RLOO (Leave-One-Out) baseline:
b_i = (1 / (n - 1)) * sum over j != i of R_j
where:
n = number of samples in the batch
R_j = reward for sample j
This eliminates critic training while maintaining variance reduction through the leave-one-out baseline. The baseline is computed purely from rewards of other samples in the same batch, requiring no learned parameters.
Pseudo-code
FUNCTION setup_reinforce_actor_and_rm_client(pretrained_model, rm_server_url):
# Load the actor model with generation and REINFORCE capabilities
actor = load_pretrained_model(pretrained_model)
actor = extend_with_reinforce_loss(actor)
# Optionally enable TRT-LLM for accelerated generation
IF config.use_trt_llm:
actor.init_trt_llm_generation()
# Store reference policy weights on CPU
reference_policy = copy_state_dict_to_cpu(actor)
# Connect to remote reward model server (no critic needed)
rm_client = create_http_client(rm_server_url)
rm_client.register_endpoints(["infer"])
RETURN actor, rm_client, reference_policy