| Implementation Details
|
| Name |
MegatronGPT_Reinforce_Actor_And_RM_Client
|
| Type |
API Doc
|
| Implements Principle |
REINFORCE_Actor_Setup
|
| Module |
nemo_aligner.models.nlp.gpt
|
| Repository |
NeMo Aligner
|
| Last Updated |
2026-02-07 00:00 GMT
|
Overview
Concrete tools for REINFORCE actor model initialization and remote reward model HTTP client communication provided by the NeMo Aligner models module.
Description
MegatronGPTReinforceActorModel extends MegatronGPTModel with REINFORCE-specific capabilities: text generation, log-probability computation, reference policy log-prob retrieval, and the REINFORCE policy gradient loss (-log_prob * (reward - baseline)). RemoteGPTRMClient provides HTTP communication with a reward model server, providing a simpler interface than the PPO critic client (inference only, no training endpoint). The actor supports optional TRT-LLM acceleration for generation.
Usage
Used in REINFORCE training scripts. The actor is loaded from a pretrained checkpoint. The RM client connects to a running reward model server.
Code Reference
Source Location
- Repository: NeMo Aligner
- File:
nemo_aligner/models/nlp/gpt/megatron_gpt_reinforce_actor.py (L64-394), nemo_aligner/models/nlp/gpt/reward_critic_clients.py (L185-219)
Signature
class MegatronGPTReinforceActorModel(NLPAdapterModelMixin, MegatronGPTModel, AlignableGenerativeInterface):
def __init__(self, cfg: DictConfig, trainer: Trainer):
...
def infer(self, inference_batch: dict) -> dict:
"""Generate responses."""
def get_init_policy_logprobs(self, response_tokens: Tensor) -> Tensor:
"""Compute reference policy log-probs for KL penalty."""
class RemoteGPTRMClient:
def __init__(self, cfg: DictConfig):
...
def infer_rm(self, rollout_batch: dict) -> RMFutureResult:
"""Get reward scores from remote server."""
Import
from nemo_aligner.models.nlp.gpt.megatron_gpt_reinforce_actor import MegatronGPTReinforceActorModel
from nemo_aligner.models.nlp.gpt.reward_critic_clients import RemoteGPTRMClient
I/O Contract
Inputs (MegatronGPTReinforceActorModel.infer)
| Name |
Type |
Required |
Description
|
| inference_batch |
dict |
Yes |
Dict with prompt token tensors
|
Outputs (MegatronGPTReinforceActorModel.infer)
| Name |
Type |
Description
|
| response_tokens |
Tensor |
Generated sequences
|
| response_lengths |
Tensor |
Sequence lengths
|
| prompt_lengths |
Tensor |
Prompt lengths
|
| is_end |
Tensor |
EOS flags
|
Inputs (RemoteGPTRMClient)
| Name |
Type |
Required |
Description
|
| cfg |
DictConfig |
Yes |
RM server connection config
|
Outputs (RemoteGPTRMClient.infer_rm)
| Name |
Type |
Description
|
| rewards |
np.ndarray |
Reward scores
|
Usage Examples
from nemo_aligner.models.nlp.gpt.megatron_gpt_reinforce_actor import MegatronGPTReinforceActorModel
from nemo_aligner.models.nlp.gpt.reward_critic_clients import RemoteGPTRMClient
actor = load_from_nemo(MegatronGPTReinforceActorModel, model_cfg, trainer, restore_path=path)
rm_client = RemoteGPTRMClient(cfg.remote_rm)
rollout = actor.infer(prompt_batch)
result = rm_client.infer_rm(rollout)
Related Pages
Knowledge Sources
Reinforcement_Learning, NLP
Page Connections
Double-click a node to navigate. Hold to expand connections.