Principle:Allenai Open instruct Streaming Generation Configuration
| Knowledge Sources | |
|---|---|
| Domains | Reinforcement Learning Configuration Management |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Streaming generation configuration is the practice of specifying the hyperparameters that control how an RL training pipeline generates rollouts, computes rewards, and prepares batches for policy updates.
Description
In GRPO training, the generation phase is the most computationally expensive component. The streaming generation configuration governs every aspect of this phase:
- Batch sizing: How many unique prompts to generate per rollout (
num_unique_prompts_rollout) and how many completions to sample per prompt (num_samples_per_prompt_rollout). These together determine the total number of completions per training step. - Response parameters: Maximum response length (
response_length), maximum prompt length (max_prompt_token_length), and the total pack length (pack_length) for sequence packing. - Sampling parameters: Temperature for generation diversity, stop strings for early termination, and top-p filtering.
- Asynchronous generation: The number of async steps (
async_steps) controls how many batches of prompts are queued ahead of the training loop, enabling overlap between generation and training. - Reward configuration: Settings for verifiable rewards (correctness checking), R1-style format rewards, LLM judge rewards, and code execution rewards.
- Filtering: Options to filter prompts where all completions have the same reward (zero standard deviation), mask truncated completions, and exclude mastered prompts via active sampling.
The configuration must satisfy invariants: pack_length >= max_prompt_token_length + response_length, and at least one reward type must be enabled.
Usage
This configuration is instantiated once per training run and threaded through the entire GRPO pipeline. It is used by the data preparation actor, the generation engines, and the reward computation logic. Modifying these parameters is the primary way to tune the generation-training tradeoff in GRPO.
Theoretical Basis
The key theoretical considerations behind the configuration parameters:
Group size (num_samples_per_prompt_rollout): In GRPO, advantages are computed relative to the group of completions for each prompt. A group size of 1 reduces GRPO to REINFORCE with a baseline of zero. Larger groups provide more stable advantage estimates but require more generation compute. Typical values range from 4 to 16.
Temperature: Controls the entropy of the sampling distribution. Higher temperatures increase diversity among group members, which is important for advantage estimation:
P(token_i) = softmax(logit_i / temperature)
Typical values are 0.7-1.0. Too low a temperature leads to degenerate groups where all completions are identical.
Async steps: The number of batches queued ahead enables the pipeline to overlap generation (GPU-bound on vLLM engines) with training (GPU-bound on learner GPUs). With async_steps=1, the system is fully synchronous. Higher values improve throughput but increase the staleness of the policy used for generation.
Advantage normalization: Two modes are supported:
- Standard:
(score - mean) / (std + epsilon)-- normalizes advantages to have zero mean and unit variance. - Centered:
score - mean-- only centers, without scaling.