Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:OpenRLHF OpenRLHF Rejection sampling processor

From Leeroopedia


Knowledge Sources
Domains Alignment, Data_Processing
Last Updated 2026-02-07 00:00 GMT

Overview

Concrete tool for selecting best-of-N responses via rejection sampling provided by OpenRLHF.

Description

The rejection_sampling_processor function takes a list of scored generation objects (input, output, reward), groups them by input prompt, and keeps only the highest-reward response for each prompt. The result is a filtered SFT-compatible dataset.

Usage

Called after batch vLLM generation and batch reward model inference. The output is used to create a new SFT dataset for retraining.

Code Reference

Source Location

  • Repository: OpenRLHF
  • File: openrlhf/utils/processor.py
  • Lines: L40-53

Signature

def rejection_sampling_processor(args, objs):
    """
    Select best response per prompt by reward score.

    Args:
        args: CLI arguments (unused in this processor)
        objs: List of dicts with keys: "input", "output", "reward"

    Returns:
        List of dicts: [{"input": str, "output": str, "reward": float}]
            One entry per unique prompt with the highest-reward response.
    """

Import

from openrlhf.utils.processor import rejection_sampling_processor
# or
from openrlhf.utils.processor import get_processor
processor = get_processor("rs")

I/O Contract

Inputs

Name Type Required Description
args Namespace Yes CLI arguments
objs List[Dict] Yes Scored generations: [{input, output, reward}, ...]

Outputs

Name Type Description
filtered List[Dict] Best response per prompt: [{input, output, reward}, ...]

Usage Examples

from openrlhf.utils.processor import get_processor

processor = get_processor("rs")
filtered_data = processor(args, scored_generations)
# filtered_data contains one best response per unique prompt

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment