Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Allenai Open instruct Setup Datasets

From Leeroopedia


Type Function
Source open_instruct/grpo_fast.py:L1087-1163
Dependencies datasets, transformers, open_instruct.dataset_transformation
Last Updated 2026-02-07 00:00 GMT

Overview

Concrete tool for loading and preparing prompt-only datasets for RLVR-based GRPO training, provided by the Open Instruct library.

Description

The setup_datasets() function is the main entry point for preparing training and evaluation datasets in the GRPO pipeline. It performs the following operations:

  1. Optionally loads a system prompt override from a file.
  2. Constructs transform function arguments that include system prompts, tool definitions, and maximum prompt length filters.
  3. Calls get_cached_dataset_tulu() to load, transform, and cache the training dataset from the configured dataset mixer.
  4. Validates that any per-sample tool definitions in the dataset match the configured tool call names.
  5. Shuffles the training dataset with the experiment seed for reproducibility.
  6. If an evaluation dataset mixer is configured, loads and optionally shuffles the evaluation dataset.
  7. Visualizes the first tokenized example for debugging purposes.

Usage

Import and call this function during GRPO training initialization, before creating the data preparation actor or any generation engines. It should be called exactly once per training run.

Code Reference

Source Location

Signature

def setup_datasets(
    args: grpo_utils.ExperimentConfig,
    tc: TokenizerConfig,
    tokenizer: PreTrainedTokenizer,
    streaming_config: data_loader_lib.StreamingDataLoaderConfig,
    tool_definitions: list[dict[str, Any]],
    pass_tools_to_chat_template: bool,
    configured_tool_call_names: list[str] | None = None,
) -> tuple[Dataset, Dataset | None]:

Import

from open_instruct.grpo_fast import setup_datasets

I/O Contract

Inputs

Name Type Description
args ExperimentConfig Experiment configuration containing seed, hf_entity, etc.
tc TokenizerConfig Tokenizer configuration specifying chat template and special tokens.
tokenizer PreTrainedTokenizer The tokenizer instance for encoding prompts.
streaming_config StreamingDataLoaderConfig Configuration for dataset mixers, transforms, caching, and prompt length limits.
tool_definitions list[dict[str, Any]] OpenAI-format tool definitions to include in chat templates.
pass_tools_to_chat_template bool Whether to inject tool definitions into the chat template.
configured_tool_call_names None Optional list of valid tool call names for validation against per-sample tools.

Outputs

Name Type Description
train_dataset Dataset Shuffled HuggingFace Dataset with tokenized prompts, ground truths, and verifier source metadata. Each example contains input_ids_prompt, ground_truths, verifier_source, and raw_prompt.
eval_dataset None Evaluation dataset (may be None if no eval mixer is configured). Same schema as training dataset.

Usage Examples

from open_instruct.grpo_fast import setup_datasets
from open_instruct.grpo_utils import ExperimentConfig
from open_instruct.data_loader import StreamingDataLoaderConfig
from transformers import AutoTokenizer

args = ExperimentConfig(seed=42, hf_entity="allenai")
streaming_config = StreamingDataLoaderConfig(
    dataset_mixer_list=["ai2-adapt-dev/rlvr_gsm8k_zs", "1.0"],
    dataset_mixer_eval_list=["ai2-adapt-dev/rlvr_gsm8k_zs", "1.0"],
    max_prompt_token_length=256,
)
tokenizer = AutoTokenizer.from_pretrained("allenai/OLMo-2-7B")

train_dataset, eval_dataset = setup_datasets(
    args=args,
    tc=tokenizer_config,
    tokenizer=tokenizer,
    streaming_config=streaming_config,
    tool_definitions=[],
    pass_tools_to_chat_template=False,
)
print(f"Train: {len(train_dataset)} examples, Eval: {len(eval_dataset)} examples")

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment