Heuristic:OpenRLHF OpenRLHF Off Policy IS Correction Tip

Knowledge Sources	Off-Policy RL in Efficient Frameworks OpenRLHF
Domains	Optimization, Reinforcement_Learning, LLMs
Last Updated	2026-02-07 10:00 GMT

Overview

Use importance sampling correction (TIS/ICEPOP/seq-mask-TIS) with default threshold [0.5, 1.5] to enable off-policy training for higher sample efficiency.

Description

In standard PPO training, the vLLM generation engine may produce samples using stale policy weights (before the latest training update), creating an off-policy gap. OpenRLHF implements three importance sampling correction methods to address this: TIS (token-level clamping), ICEPOP (token-level filtering), and seq-mask-TIS (sequence-level geometric mean filtering with token-level clamping). These methods compute importance weights as the ratio of current policy to rollout policy probabilities, then apply corrections to the policy loss.

Usage

Use this heuristic when running PPO training with vLLM where the generation policy may lag behind the training policy. Enable with `--enable_vllm_is_correction`. Choose the correction type with `--vllm_is_correction_type` (default: "tis"). Adjust the threshold range with `--vllm_is_truncated_threshold` (default: [0.5, 1.5]).

The Insight (Rule of Thumb)

Action: Add `--enable_vllm_is_correction` with a correction type:
- `--vllm_is_correction_type tis` (default): Token-level importance weight clamping within [low, high]
- `--vllm_is_correction_type icepop`: Token-level filtering; sets weights outside range to 0
- `--vllm_is_correction_type seq-mask-tis`: Sequence-level geometric mean for filtering, token-level TIS for correction
Value: Default threshold `--vllm_is_truncated_threshold 0.5 1.5` works well empirically.
Trade-off: Adds minor computation overhead per training step. Improves sample efficiency by enabling off-policy reuse of generated samples.

Reasoning

Efficient RL frameworks like OpenRLHF secretly introduce off-policy data because the generation engine (vLLM) uses older weights than the training engine. Without correction, this off-policy gap can lead to training instability or suboptimal convergence. Importance sampling re-weights the loss by the probability ratio between the old (rollout) policy and the current (training) policy. TIS clamps extreme ratios; ICEPOP zeros them out; seq-mask-TIS combines sequence-level filtering with token-level correction for a balanced approach.

Code evidence from `openrlhf/models/loss.py:150-170`:

# Your Efficient RL Framework Secretly Brings You Off-Policy RL Training
if self.enable_vllm_is_correction and self.policy_loss_type == "ppo":
    low_threshold, high_threshold = self.vllm_is_truncated_threshold
    log_ratio = old_log_probs - rollout_log_probs
    if self.vllm_is_correction_type == "icepop":
        # ICEPOP: token-level filtering (set coefficients outside the interval to 0)
        vllm_is = torch.exp(log_ratio).detach()
        mask = (vllm_is >= low_threshold) & (vllm_is <= high_threshold)
        vllm_is = vllm_is * mask
        loss = vllm_is * loss
    elif self.vllm_is_correction_type == "seq-mask-tis":
        # seq-mask-tis: use sequence-level geometric mean only for filtering,
        # correction coefficients still use TIS (token-level clamp)
        seq_log_ratio = masked_mean(log_ratio, action_mask, dim=-1)
        seq_is = torch.exp(seq_log_ratio)
        seq_mask = (seq_is >= low_threshold) & (seq_is <= high_threshold)
        vllm_is = torch.exp(log_ratio).detach()
        loss = seq_mask.unsqueeze(-1) * vllm_is * loss

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment