Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:NVIDIA NeMo Aligner RewardModelServer Run

From Leeroopedia


Implementation Details
Name RewardModelServer_Run
Type API Doc
Implements Reward_Model_Serving
Repository NeMo Aligner
Primary File nemo_aligner/algorithms/reward_server.py
Domains Serving, Distributed_Systems
Last Updated 2026-02-07 00:00 GMT

Overview

Concrete tool for serving reward models as HTTP inference endpoints via PyTriton provided by the NeMo Aligner algorithms module.

Description

The RewardModelServer dataclass wraps a frozen reward model behind a PyTriton HTTP server. It configures dynamic batching, thread-safe inference, and distributed broadcasting for multi-GPU model parallelism. Rank 0 runs the PyTriton Triton server, while other ranks run a subscriber loop that listens for broadcast signals to execute synchronized inference. The server accepts batches of tokens or sentences and returns float32 reward scores.

Usage

Import when deploying a trained reward model for RLHF. Used in serve_reward_model.py. The server is accessed by REINFORCE actors via RemoteGPTRMClient or by PPO critics via combined inference.

Code Reference

Source Location

  • Repository: NeMo Aligner
  • File: nemo_aligner/algorithms/reward_server.py
  • Lines: L40-135

Signature

@dataclass
class RewardModelServer:
    infer_fn: Callable                        # Model's inference function
    tokenize_func: Callable                   # Tokenization closure
    model_name: str                           # Triton model name
    port: int                                 # Server port
    inference_micro_batch_size: Union[int, List]  # Batch size(s) for inference
    model_forward_micro_batch_size: int       # Micro batch for model forward
    strip_sequence_length_to_multiple: Optional[int]  # Sequence length alignment
    max_queue_delay_microseconds: float = 2000  # Dynamic batching delay

    def infer(self, **inputs: np.ndarray) -> Dict[str, np.ndarray]:
        """Process inference request. Returns {"rewards": np.float32 array}."""

    def run_server(self) -> None:
        """Start PyTriton server on rank 0, subscriber loop on other ranks."""

Import

from nemo_aligner.algorithms.reward_server import RewardModelServer

I/O Contract

Inputs (Constructor)

Name Type Required Description
infer_fn Callable Yes Model's inference function that takes token tensors
tokenize_func Callable Yes Function to tokenize string inputs
model_name str Yes Name for Triton model registration
port int Yes HTTP server port (default 5555)
inference_micro_batch_size Union[int, List] Yes Preferred batch sizes for dynamic batching
model_forward_micro_batch_size int Yes Micro batch size for model forward pass
strip_sequence_length_to_multiple Optional[int] No Sequence length alignment
max_queue_delay_microseconds float No Dynamic batching delay (default 2000)

Outputs (infer method)

Name Type Description
rewards np.ndarray (float32) Scalar reward score per input sequence

Usage Examples

from nemo_aligner.algorithms.reward_server import RewardModelServer

server = RewardModelServer(
    infer_fn=model.infer,
    tokenize_func=tokenize_func,
    model_name="reward_model",
    port=5555,
    inference_micro_batch_size=[4, 8],
    model_forward_micro_batch_size=4,
    strip_sequence_length_to_multiple=16,
)
server.run_server()  # Blocks: rank 0 serves, others run subscriber loop

Related Pages

Knowledge Sources

Serving | Distributed_Systems

2026-02-07 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment