Implementation:NVIDIA NeMo Aligner MegatronGPT Reinforce Actor And RM Client

Implementation Details
Name	MegatronGPT_Reinforce_Actor_And_RM_Client
Type	API Doc
Implements Principle	REINFORCE_Actor_Setup
Module	nemo_aligner.models.nlp.gpt
Repository	NeMo Aligner
Last Updated	2026-02-07 00:00 GMT

Overview

Concrete tools for REINFORCE actor model initialization and remote reward model HTTP client communication provided by the NeMo Aligner models module.

Description

MegatronGPTReinforceActorModel extends MegatronGPTModel with REINFORCE-specific capabilities: text generation, log-probability computation, reference policy log-prob retrieval, and the REINFORCE policy gradient loss (-log_prob * (reward - baseline)). RemoteGPTRMClient provides HTTP communication with a reward model server, providing a simpler interface than the PPO critic client (inference only, no training endpoint). The actor supports optional TRT-LLM acceleration for generation.

Usage

Used in REINFORCE training scripts. The actor is loaded from a pretrained checkpoint. The RM client connects to a running reward model server.

Code Reference

Source Location

Repository: NeMo Aligner
File: nemo_aligner/models/nlp/gpt/megatron_gpt_reinforce_actor.py (L64-394), nemo_aligner/models/nlp/gpt/reward_critic_clients.py (L185-219)

Signature

class MegatronGPTReinforceActorModel(NLPAdapterModelMixin, MegatronGPTModel, AlignableGenerativeInterface):
    def __init__(self, cfg: DictConfig, trainer: Trainer):
        ...
    def infer(self, inference_batch: dict) -> dict:
        """Generate responses."""
    def get_init_policy_logprobs(self, response_tokens: Tensor) -> Tensor:
        """Compute reference policy log-probs for KL penalty."""

class RemoteGPTRMClient:
    def __init__(self, cfg: DictConfig):
        ...
    def infer_rm(self, rollout_batch: dict) -> RMFutureResult:
        """Get reward scores from remote server."""

Import

from nemo_aligner.models.nlp.gpt.megatron_gpt_reinforce_actor import MegatronGPTReinforceActorModel
from nemo_aligner.models.nlp.gpt.reward_critic_clients import RemoteGPTRMClient

I/O Contract

Inputs (MegatronGPTReinforceActorModel.infer)

Name	Type	Required	Description
inference_batch	dict	Yes	Dict with prompt token tensors

Outputs (MegatronGPTReinforceActorModel.infer)

Name	Type	Description
response_tokens	Tensor	Generated sequences
response_lengths	Tensor	Sequence lengths
prompt_lengths	Tensor	Prompt lengths
is_end	Tensor	EOS flags

Inputs (RemoteGPTRMClient)

Name	Type	Required	Description
cfg	DictConfig	Yes	RM server connection config

Outputs (RemoteGPTRMClient.infer_rm)

Name	Type	Description
rewards	np.ndarray	Reward scores

Usage Examples

from nemo_aligner.models.nlp.gpt.megatron_gpt_reinforce_actor import MegatronGPTReinforceActorModel
from nemo_aligner.models.nlp.gpt.reward_critic_clients import RemoteGPTRMClient

actor = load_from_nemo(MegatronGPTReinforceActorModel, model_cfg, trainer, restore_path=path)
rm_client = RemoteGPTRMClient(cfg.remote_rm)

rollout = actor.infer(prompt_batch)
result = rm_client.infer_rm(rollout)

Related Pages

Knowledge Sources

NeMo Aligner

Reinforcement_Learning, NLP

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment