Heuristic:OpenRLHF OpenRLHF Off Policy IS Correction Tip
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Reinforcement_Learning, LLMs |
| Last Updated | 2026-02-07 10:00 GMT |
Overview
Use importance sampling correction (TIS/ICEPOP/seq-mask-TIS) with default threshold [0.5, 1.5] to enable off-policy training for higher sample efficiency.
Description
In standard PPO training, the vLLM generation engine may produce samples using stale policy weights (before the latest training update), creating an off-policy gap. OpenRLHF implements three importance sampling correction methods to address this: TIS (token-level clamping), ICEPOP (token-level filtering), and seq-mask-TIS (sequence-level geometric mean filtering with token-level clamping). These methods compute importance weights as the ratio of current policy to rollout policy probabilities, then apply corrections to the policy loss.
Usage
Use this heuristic when running PPO training with vLLM where the generation policy may lag behind the training policy. Enable with `--enable_vllm_is_correction`. Choose the correction type with `--vllm_is_correction_type` (default: "tis"). Adjust the threshold range with `--vllm_is_truncated_threshold` (default: [0.5, 1.5]).
The Insight (Rule of Thumb)
- Action: Add `--enable_vllm_is_correction` with a correction type:
- `--vllm_is_correction_type tis` (default): Token-level importance weight clamping within [low, high]
- `--vllm_is_correction_type icepop`: Token-level filtering; sets weights outside range to 0
- `--vllm_is_correction_type seq-mask-tis`: Sequence-level geometric mean for filtering, token-level TIS for correction
- Value: Default threshold `--vllm_is_truncated_threshold 0.5 1.5` works well empirically.
- Trade-off: Adds minor computation overhead per training step. Improves sample efficiency by enabling off-policy reuse of generated samples.
Reasoning
Efficient RL frameworks like OpenRLHF secretly introduce off-policy data because the generation engine (vLLM) uses older weights than the training engine. Without correction, this off-policy gap can lead to training instability or suboptimal convergence. Importance sampling re-weights the loss by the probability ratio between the old (rollout) policy and the current (training) policy. TIS clamps extreme ratios; ICEPOP zeros them out; seq-mask-TIS combines sequence-level filtering with token-level correction for a balanced approach.
Code evidence from `openrlhf/models/loss.py:150-170`:
# Your Efficient RL Framework Secretly Brings You Off-Policy RL Training
if self.enable_vllm_is_correction and self.policy_loss_type == "ppo":
low_threshold, high_threshold = self.vllm_is_truncated_threshold
log_ratio = old_log_probs - rollout_log_probs
if self.vllm_is_correction_type == "icepop":
# ICEPOP: token-level filtering (set coefficients outside the interval to 0)
vllm_is = torch.exp(log_ratio).detach()
mask = (vllm_is >= low_threshold) & (vllm_is <= high_threshold)
vllm_is = vllm_is * mask
loss = vllm_is * loss
elif self.vllm_is_correction_type == "seq-mask-tis":
# seq-mask-tis: use sequence-level geometric mean only for filtering,
# correction coefficients still use TIS (token-level clamp)
seq_log_ratio = masked_mean(log_ratio, action_mask, dim=-1)
seq_is = torch.exp(seq_log_ratio)
seq_mask = (seq_is >= low_threshold) & (seq_is <= high_threshold)
vllm_is = torch.exp(log_ratio).detach()
loss = seq_mask.unsqueeze(-1) * vllm_is * loss