Implementation:Allenai Open instruct StreamingDataLoaderConfig

Type	Dataclass
Source	`open_instruct/data_loader.py:L297-437`
Dependencies	dataclasses, vllm, datasets, transformers
Last Updated	2026-02-07 00:00 GMT

Overview

Concrete configuration dataclass for controlling streaming generation, reward computation, and batch preparation in the GRPO training pipeline, provided by the Open Instruct library.

Description

StreamingDataLoaderConfig is a Python dataclass that centralizes all configuration for the generation side of GRPO training. It includes parameters for:

Data loading and packing: Maximum prompt/response lengths and pack length.
Batching: Number of unique prompts per rollout, samples per prompt, and async steps.
GRPO sampling/filtering: Active sampling, zero-std filtering, advantage normalization type, completion masking.
Dataset specification: Dataset mixer lists, splits, transform functions, and caching modes.
Generation: Temperature, stop strings, inflight weight updates.
Reward: Verifiable reward toggles, R1-style format rewards, LLM judge configuration, code verifier settings, non-stop penalties.
Rollout saving: Whether to save rollout traces to disk for analysis.

The __post_init__ method enforces invariants and computes derived fields such as max_possible_score.

Usage

This dataclass is typically populated from command-line arguments and passed to the GRPO main function. It is consumed by the DataPreparationActor, build_all_verifiers(), and the generation engine configuration.

Code Reference

Source Location

Repository: Open Instruct
File: open_instruct/data_loader.py

Signature

@dataclass
class StreamingDataLoaderConfig:
    # Data loading/packing
    max_prompt_token_length: int = 256
    response_length: int = 256
    pack_length: int = 512

    # Batching
    async_steps: int = 1
    num_samples_per_prompt_rollout: int = 4
    num_unique_prompts_rollout: int = 16

    # GRPO sampling/filtering
    active_sampling: bool = False
    filter_zero_std_samples: bool = True
    no_resampling_pass_rate: float | None = None
    advantage_normalization_type: str = "standard"
    mask_truncated_completions: bool = False
    mask_tool_use: bool = True

    # Dataset
    dataset_mixer_list: list[str] = ...
    dataset_mixer_eval_list: list[str] = ...
    dataset_transform_fn: list[str] = ...

    # Generation
    temperature: float = 0.7
    stop_strings: list[str] | None = None
    inflight_updates: bool = False

    # Reward - Verifiable reward
    apply_verifiable_reward: bool = True
    verification_reward: float = 10.0

    # Reward - R1 style format reward
    apply_r1_style_format_reward: bool = False
    r1_style_format_reward: float = 1.0

    # ... additional reward fields (LLM judge, code verifier, etc.)

Import

from open_instruct.data_loader import StreamingDataLoaderConfig

I/O Contract

Key Fields

Field	Type	Default	Description
`num_unique_prompts_rollout`	`int`	16	Number of unique prompts per generation rollout.
`num_samples_per_prompt_rollout`	`int`	4	Number of completions to sample per prompt (GRPO group size).
`response_length`	`int`	256	Maximum response token length.
`temperature`	`float`	0.7	Sampling temperature for generation.
`pack_length`	`int`	512	Maximum length of packed sequences for training.
`async_steps`	`int`	1	Number of generation batches queued ahead of the trainer.
`filter_zero_std_samples`	`bool`	True	Filter prompts where all completions get the same reward.
`stop_strings`	None	None	Stop strings for early generation termination.
`verification_reward`	`float`	10.0	Reward value for correct verifiable answers.
`advantage_normalization_type`	`str`	"standard"	"standard" (z-score) or "centered" (mean subtraction only).

Computed Fields

Field	Description
`max_possible_score`	Sum of all enabled reward components; computed in `__post_init__`.

Key Method

Method	Description
`build_dataloader(...)`	Constructs a `StreamingDataLoader` that pulls pre-prepared data from a `DataPreparationActor`.

Usage Examples

from open_instruct.data_loader import StreamingDataLoaderConfig

config = StreamingDataLoaderConfig(
    num_unique_prompts_rollout=32,
    num_samples_per_prompt_rollout=8,
    response_length=1024,
    temperature=0.8,
    pack_length=2048,
    max_prompt_token_length=512,
    async_steps=2,
    dataset_mixer_list=["ai2-adapt-dev/rlvr_gsm8k_zs", "0.5",
                        "ai2-adapt-dev/rlvr_math_zs", "0.5"],
    filter_zero_std_samples=True,
    apply_verifiable_reward=True,
    verification_reward=10.0,
)

# Total completions per step: 32 * 8 = 256
# Total tokens per step (max): 256 * 2048 = 524288

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment