Implementation:NVIDIA NeMo Aligner RewardModelServer Run
| Implementation Details | |
|---|---|
| Name | RewardModelServer_Run |
| Type | API Doc |
| Implements | Reward_Model_Serving |
| Repository | NeMo Aligner |
| Primary File | nemo_aligner/algorithms/reward_server.py |
| Domains | Serving, Distributed_Systems |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Concrete tool for serving reward models as HTTP inference endpoints via PyTriton provided by the NeMo Aligner algorithms module.
Description
The RewardModelServer dataclass wraps a frozen reward model behind a PyTriton HTTP server. It configures dynamic batching, thread-safe inference, and distributed broadcasting for multi-GPU model parallelism. Rank 0 runs the PyTriton Triton server, while other ranks run a subscriber loop that listens for broadcast signals to execute synchronized inference. The server accepts batches of tokens or sentences and returns float32 reward scores.
Usage
Import when deploying a trained reward model for RLHF. Used in serve_reward_model.py. The server is accessed by REINFORCE actors via RemoteGPTRMClient or by PPO critics via combined inference.
Code Reference
Source Location
- Repository: NeMo Aligner
- File:
nemo_aligner/algorithms/reward_server.py - Lines: L40-135
Signature
@dataclass
class RewardModelServer:
infer_fn: Callable # Model's inference function
tokenize_func: Callable # Tokenization closure
model_name: str # Triton model name
port: int # Server port
inference_micro_batch_size: Union[int, List] # Batch size(s) for inference
model_forward_micro_batch_size: int # Micro batch for model forward
strip_sequence_length_to_multiple: Optional[int] # Sequence length alignment
max_queue_delay_microseconds: float = 2000 # Dynamic batching delay
def infer(self, **inputs: np.ndarray) -> Dict[str, np.ndarray]:
"""Process inference request. Returns {"rewards": np.float32 array}."""
def run_server(self) -> None:
"""Start PyTriton server on rank 0, subscriber loop on other ranks."""
Import
from nemo_aligner.algorithms.reward_server import RewardModelServer
I/O Contract
Inputs (Constructor)
| Name | Type | Required | Description |
|---|---|---|---|
infer_fn |
Callable |
Yes | Model's inference function that takes token tensors |
tokenize_func |
Callable |
Yes | Function to tokenize string inputs |
model_name |
str |
Yes | Name for Triton model registration |
port |
int |
Yes | HTTP server port (default 5555) |
inference_micro_batch_size |
Union[int, List] |
Yes | Preferred batch sizes for dynamic batching |
model_forward_micro_batch_size |
int |
Yes | Micro batch size for model forward pass |
strip_sequence_length_to_multiple |
Optional[int] |
No | Sequence length alignment |
max_queue_delay_microseconds |
float |
No | Dynamic batching delay (default 2000) |
Outputs (infer method)
| Name | Type | Description |
|---|---|---|
rewards |
np.ndarray (float32) |
Scalar reward score per input sequence |
Usage Examples
from nemo_aligner.algorithms.reward_server import RewardModelServer
server = RewardModelServer(
infer_fn=model.infer,
tokenize_func=tokenize_func,
model_name="reward_model",
port=5555,
inference_micro_batch_size=[4, 8],
model_forward_micro_batch_size=4,
strip_sequence_length_to_multiple=16,
)
server.run_server() # Blocks: rank 0 serves, others run subscriber loop
Related Pages
- Principle:NVIDIA_NeMo_Aligner_Reward_Model_Serving
- Environment:NVIDIA_NeMo_Aligner_NeMo_Framework_GPU_Environment
- Environment:NVIDIA_NeMo_Aligner_PyTriton_Serving_Environment