Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:NVIDIA NeMo Aligner Critic Server Deployment

From Leeroopedia


Principle: Critic Server Deployment
Type Principle
Project NVIDIA NeMo Aligner
Domains Reinforcement_Learning, Distributed_Systems
Related Implementation:NVIDIA_NeMo_Aligner_CriticServerTrainer_Run
Last Updated 2026-02-07 00:00 GMT

Overview

Deployment pattern for serving a combined critic and reward model as a trainable HTTP service in PPO-based RLHF.

Description

PPO requires a critic (value function) that estimates future rewards for each token position. Unlike the reward model server (inference-only), the critic server supports both inference AND training operations. During PPO rollouts, it provides value estimates and rewards. After rollouts, the actor sends computed returns and advantages back to the critic for training updates.

The server exposes three PyTriton endpoints:

  • infer -- value estimates plus optional rewards
  • train -- critic weight update using computed returns and advantages
  • save -- checkpoint persistence to disk

The critic model typically co-locates the reward model head for combined inference, allowing a single server process to produce both value predictions V(s) and reward scores R(s,a) in one forward pass.

Usage

Use exclusively in PPO training. The critic server runs as serve_ppo_critic.py and handles both value estimation and critic training.

  • Not needed for REINFORCE (which uses only a reward model, no critic).
  • Not needed for DPO/SFT (offline methods with no online value estimation).

The server must be launched before the PPO actor training process begins, as the actor connects to it via HTTP during initialization.

Theoretical Basis

In actor-critic reinforcement learning, the critic V(s) estimates the state value function. PPO uses the critic to compute advantages via Generalized Advantage Estimation (GAE):

A(s, a) = R(s, a) + gamma * V(s') - V(s)

GAE:
  A_t = sum over l of (gamma * lambda)^l * delta_{t+l}
  where delta_t = r_t + gamma * V(s_{t+1}) - V(s_t)

The critic must be updated to track the changing policy -- hence the need for a trainable server rather than a static inference server. As the actor policy shifts during training, the value function must adapt to accurately estimate expected returns under the new policy distribution.

Pseudo-code

FUNCTION serve_critic(model, config):
    # Initialize critic model with reward head
    critic = load_critic_with_reward_head(model, config)

    # Register PyTriton endpoints
    register_endpoint("infer", critic.compute_values_and_rewards)
    register_endpoint("train", critic.update_weights)
    register_endpoint("save", critic.save_checkpoint)

    # Start HTTP server and block
    start_server_and_wait()

Related Pages

Knowledge Sources

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment