Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Allenai Open instruct StreamingDataLoaderConfig

From Leeroopedia


Type Dataclass
Source open_instruct/data_loader.py:L297-437
Dependencies dataclasses, vllm, datasets, transformers
Last Updated 2026-02-07 00:00 GMT

Overview

Concrete configuration dataclass for controlling streaming generation, reward computation, and batch preparation in the GRPO training pipeline, provided by the Open Instruct library.

Description

StreamingDataLoaderConfig is a Python dataclass that centralizes all configuration for the generation side of GRPO training. It includes parameters for:

  • Data loading and packing: Maximum prompt/response lengths and pack length.
  • Batching: Number of unique prompts per rollout, samples per prompt, and async steps.
  • GRPO sampling/filtering: Active sampling, zero-std filtering, advantage normalization type, completion masking.
  • Dataset specification: Dataset mixer lists, splits, transform functions, and caching modes.
  • Generation: Temperature, stop strings, inflight weight updates.
  • Reward: Verifiable reward toggles, R1-style format rewards, LLM judge configuration, code verifier settings, non-stop penalties.
  • Rollout saving: Whether to save rollout traces to disk for analysis.

The __post_init__ method enforces invariants and computes derived fields such as max_possible_score.

Usage

This dataclass is typically populated from command-line arguments and passed to the GRPO main function. It is consumed by the DataPreparationActor, build_all_verifiers(), and the generation engine configuration.

Code Reference

Source Location

Signature

@dataclass
class StreamingDataLoaderConfig:
    # Data loading/packing
    max_prompt_token_length: int = 256
    response_length: int = 256
    pack_length: int = 512

    # Batching
    async_steps: int = 1
    num_samples_per_prompt_rollout: int = 4
    num_unique_prompts_rollout: int = 16

    # GRPO sampling/filtering
    active_sampling: bool = False
    filter_zero_std_samples: bool = True
    no_resampling_pass_rate: float | None = None
    advantage_normalization_type: str = "standard"
    mask_truncated_completions: bool = False
    mask_tool_use: bool = True

    # Dataset
    dataset_mixer_list: list[str] = ...
    dataset_mixer_eval_list: list[str] = ...
    dataset_transform_fn: list[str] = ...

    # Generation
    temperature: float = 0.7
    stop_strings: list[str] | None = None
    inflight_updates: bool = False

    # Reward - Verifiable reward
    apply_verifiable_reward: bool = True
    verification_reward: float = 10.0

    # Reward - R1 style format reward
    apply_r1_style_format_reward: bool = False
    r1_style_format_reward: float = 1.0

    # ... additional reward fields (LLM judge, code verifier, etc.)

Import

from open_instruct.data_loader import StreamingDataLoaderConfig

I/O Contract

Key Fields

Field Type Default Description
num_unique_prompts_rollout int 16 Number of unique prompts per generation rollout.
num_samples_per_prompt_rollout int 4 Number of completions to sample per prompt (GRPO group size).
response_length int 256 Maximum response token length.
temperature float 0.7 Sampling temperature for generation.
pack_length int 512 Maximum length of packed sequences for training.
async_steps int 1 Number of generation batches queued ahead of the trainer.
filter_zero_std_samples bool True Filter prompts where all completions get the same reward.
stop_strings None None Stop strings for early generation termination.
verification_reward float 10.0 Reward value for correct verifiable answers.
advantage_normalization_type str "standard" "standard" (z-score) or "centered" (mean subtraction only).

Computed Fields

Field Description
max_possible_score Sum of all enabled reward components; computed in __post_init__.

Key Method

Method Description
build_dataloader(...) Constructs a StreamingDataLoader that pulls pre-prepared data from a DataPreparationActor.

Usage Examples

from open_instruct.data_loader import StreamingDataLoaderConfig

config = StreamingDataLoaderConfig(
    num_unique_prompts_rollout=32,
    num_samples_per_prompt_rollout=8,
    response_length=1024,
    temperature=0.8,
    pack_length=2048,
    max_prompt_token_length=512,
    async_steps=2,
    dataset_mixer_list=["ai2-adapt-dev/rlvr_gsm8k_zs", "0.5",
                        "ai2-adapt-dev/rlvr_math_zs", "0.5"],
    filter_zero_std_samples=True,
    apply_verifiable_reward=True,
    verification_reward=10.0,
)

# Total completions per step: 32 * 8 = 256
# Total tokens per step (max): 256 * 2048 = 524288

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment