Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:NVIDIA NeMo Aligner MegatronGPT Reinforce Actor And RM Client

From Leeroopedia


Implementation Details
Name MegatronGPT_Reinforce_Actor_And_RM_Client
Type API Doc
Implements Principle REINFORCE_Actor_Setup
Module nemo_aligner.models.nlp.gpt
Repository NeMo Aligner
Last Updated 2026-02-07 00:00 GMT

Overview

Concrete tools for REINFORCE actor model initialization and remote reward model HTTP client communication provided by the NeMo Aligner models module.

Description

MegatronGPTReinforceActorModel extends MegatronGPTModel with REINFORCE-specific capabilities: text generation, log-probability computation, reference policy log-prob retrieval, and the REINFORCE policy gradient loss (-log_prob * (reward - baseline)). RemoteGPTRMClient provides HTTP communication with a reward model server, providing a simpler interface than the PPO critic client (inference only, no training endpoint). The actor supports optional TRT-LLM acceleration for generation.

Usage

Used in REINFORCE training scripts. The actor is loaded from a pretrained checkpoint. The RM client connects to a running reward model server.

Code Reference

Source Location

  • Repository: NeMo Aligner
  • File: nemo_aligner/models/nlp/gpt/megatron_gpt_reinforce_actor.py (L64-394), nemo_aligner/models/nlp/gpt/reward_critic_clients.py (L185-219)

Signature

class MegatronGPTReinforceActorModel(NLPAdapterModelMixin, MegatronGPTModel, AlignableGenerativeInterface):
    def __init__(self, cfg: DictConfig, trainer: Trainer):
        ...
    def infer(self, inference_batch: dict) -> dict:
        """Generate responses."""
    def get_init_policy_logprobs(self, response_tokens: Tensor) -> Tensor:
        """Compute reference policy log-probs for KL penalty."""

class RemoteGPTRMClient:
    def __init__(self, cfg: DictConfig):
        ...
    def infer_rm(self, rollout_batch: dict) -> RMFutureResult:
        """Get reward scores from remote server."""

Import

from nemo_aligner.models.nlp.gpt.megatron_gpt_reinforce_actor import MegatronGPTReinforceActorModel
from nemo_aligner.models.nlp.gpt.reward_critic_clients import RemoteGPTRMClient

I/O Contract

Inputs (MegatronGPTReinforceActorModel.infer)

Name Type Required Description
inference_batch dict Yes Dict with prompt token tensors

Outputs (MegatronGPTReinforceActorModel.infer)

Name Type Description
response_tokens Tensor Generated sequences
response_lengths Tensor Sequence lengths
prompt_lengths Tensor Prompt lengths
is_end Tensor EOS flags

Inputs (RemoteGPTRMClient)

Name Type Required Description
cfg DictConfig Yes RM server connection config

Outputs (RemoteGPTRMClient.infer_rm)

Name Type Description
rewards np.ndarray Reward scores

Usage Examples

from nemo_aligner.models.nlp.gpt.megatron_gpt_reinforce_actor import MegatronGPTReinforceActorModel
from nemo_aligner.models.nlp.gpt.reward_critic_clients import RemoteGPTRMClient

actor = load_from_nemo(MegatronGPTReinforceActorModel, model_cfg, trainer, restore_path=path)
rm_client = RemoteGPTRMClient(cfg.remote_rm)

rollout = actor.infer(prompt_batch)
result = rm_client.infer_rm(rollout)

Related Pages

Knowledge Sources

Reinforcement_Learning, NLP

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment