Implementation:Allenai Open instruct Setup Datasets

Type	Function
Source	`open_instruct/grpo_fast.py:L1087-1163`
Dependencies	datasets, transformers, open_instruct.dataset_transformation
Last Updated	2026-02-07 00:00 GMT

Overview

Concrete tool for loading and preparing prompt-only datasets for RLVR-based GRPO training, provided by the Open Instruct library.

Description

The setup_datasets() function is the main entry point for preparing training and evaluation datasets in the GRPO pipeline. It performs the following operations:

Optionally loads a system prompt override from a file.
Constructs transform function arguments that include system prompts, tool definitions, and maximum prompt length filters.
Calls get_cached_dataset_tulu() to load, transform, and cache the training dataset from the configured dataset mixer.
Validates that any per-sample tool definitions in the dataset match the configured tool call names.
Shuffles the training dataset with the experiment seed for reproducibility.
If an evaluation dataset mixer is configured, loads and optionally shuffles the evaluation dataset.
Visualizes the first tokenized example for debugging purposes.

Usage

Import and call this function during GRPO training initialization, before creating the data preparation actor or any generation engines. It should be called exactly once per training run.

Code Reference

Source Location

Repository: Open Instruct
File: open_instruct/grpo_fast.py

Signature

def setup_datasets(
    args: grpo_utils.ExperimentConfig,
    tc: TokenizerConfig,
    tokenizer: PreTrainedTokenizer,
    streaming_config: data_loader_lib.StreamingDataLoaderConfig,
    tool_definitions: list[dict[str, Any]],
    pass_tools_to_chat_template: bool,
    configured_tool_call_names: list[str] | None = None,
) -> tuple[Dataset, Dataset | None]:

Import

from open_instruct.grpo_fast import setup_datasets

I/O Contract

Inputs

Name	Type	Description
`args`	`ExperimentConfig`	Experiment configuration containing seed, hf_entity, etc.
`tc`	`TokenizerConfig`	Tokenizer configuration specifying chat template and special tokens.
`tokenizer`	`PreTrainedTokenizer`	The tokenizer instance for encoding prompts.
`streaming_config`	`StreamingDataLoaderConfig`	Configuration for dataset mixers, transforms, caching, and prompt length limits.
`tool_definitions`	`list[dict[str, Any]]`	OpenAI-format tool definitions to include in chat templates.
`pass_tools_to_chat_template`	`bool`	Whether to inject tool definitions into the chat template.
`configured_tool_call_names`	None	Optional list of valid tool call names for validation against per-sample tools.

Outputs

Name	Type	Description
`train_dataset`	`Dataset`	Shuffled HuggingFace Dataset with tokenized prompts, ground truths, and verifier source metadata. Each example contains `input_ids_prompt`, `ground_truths`, `verifier_source`, and `raw_prompt`.
`eval_dataset`	None	Evaluation dataset (may be None if no eval mixer is configured). Same schema as training dataset.

Usage Examples

from open_instruct.grpo_fast import setup_datasets
from open_instruct.grpo_utils import ExperimentConfig
from open_instruct.data_loader import StreamingDataLoaderConfig
from transformers import AutoTokenizer

args = ExperimentConfig(seed=42, hf_entity="allenai")
streaming_config = StreamingDataLoaderConfig(
    dataset_mixer_list=["ai2-adapt-dev/rlvr_gsm8k_zs", "1.0"],
    dataset_mixer_eval_list=["ai2-adapt-dev/rlvr_gsm8k_zs", "1.0"],
    max_prompt_token_length=256,
)
tokenizer = AutoTokenizer.from_pretrained("allenai/OLMo-2-7B")

train_dataset, eval_dataset = setup_datasets(
    args=args,
    tc=tokenizer_config,
    tokenizer=tokenizer,
    streaming_config=streaming_config,
    tool_definitions=[],
    pass_tools_to_chat_template=False,
)
print(f"Train: {len(train_dataset)} examples, Eval: {len(eval_dataset)} examples")

Related Pages

Implements Principle

Principle:Allenai_Open_instruct_RLVR_Data_Loading

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment