Principle:NVIDIA NeMo Aligner Critic Server Deployment
| Principle: Critic Server Deployment | |
|---|---|
| Type | Principle |
| Project | NVIDIA NeMo Aligner |
| Domains | Reinforcement_Learning, Distributed_Systems |
| Related | Implementation:NVIDIA_NeMo_Aligner_CriticServerTrainer_Run |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Deployment pattern for serving a combined critic and reward model as a trainable HTTP service in PPO-based RLHF.
Description
PPO requires a critic (value function) that estimates future rewards for each token position. Unlike the reward model server (inference-only), the critic server supports both inference AND training operations. During PPO rollouts, it provides value estimates and rewards. After rollouts, the actor sends computed returns and advantages back to the critic for training updates.
The server exposes three PyTriton endpoints:
- infer -- value estimates plus optional rewards
- train -- critic weight update using computed returns and advantages
- save -- checkpoint persistence to disk
The critic model typically co-locates the reward model head for combined inference, allowing a single server process to produce both value predictions V(s) and reward scores R(s,a) in one forward pass.
Usage
Use exclusively in PPO training. The critic server runs as serve_ppo_critic.py and handles both value estimation and critic training.
- Not needed for REINFORCE (which uses only a reward model, no critic).
- Not needed for DPO/SFT (offline methods with no online value estimation).
The server must be launched before the PPO actor training process begins, as the actor connects to it via HTTP during initialization.
Theoretical Basis
In actor-critic reinforcement learning, the critic V(s) estimates the state value function. PPO uses the critic to compute advantages via Generalized Advantage Estimation (GAE):
A(s, a) = R(s, a) + gamma * V(s') - V(s)
GAE:
A_t = sum over l of (gamma * lambda)^l * delta_{t+l}
where delta_t = r_t + gamma * V(s_{t+1}) - V(s_t)
The critic must be updated to track the changing policy -- hence the need for a trainable server rather than a static inference server. As the actor policy shifts during training, the value function must adapt to accurately estimate expected returns under the new policy distribution.
Pseudo-code
FUNCTION serve_critic(model, config):
# Initialize critic model with reward head
critic = load_critic_with_reward_head(model, config)
# Register PyTriton endpoints
register_endpoint("infer", critic.compute_values_and_rewards)
register_endpoint("train", critic.update_weights)
register_endpoint("save", critic.save_checkpoint)
# Start HTTP server and block
start_server_and_wait()