Implementation:Allenai Open instruct Setup Datasets
Appearance
| Type | Function |
|---|---|
| Source | open_instruct/grpo_fast.py:L1087-1163
|
| Dependencies | datasets, transformers, open_instruct.dataset_transformation |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Concrete tool for loading and preparing prompt-only datasets for RLVR-based GRPO training, provided by the Open Instruct library.
Description
The setup_datasets() function is the main entry point for preparing training and evaluation datasets in the GRPO pipeline. It performs the following operations:
- Optionally loads a system prompt override from a file.
- Constructs transform function arguments that include system prompts, tool definitions, and maximum prompt length filters.
- Calls
get_cached_dataset_tulu()to load, transform, and cache the training dataset from the configured dataset mixer. - Validates that any per-sample tool definitions in the dataset match the configured tool call names.
- Shuffles the training dataset with the experiment seed for reproducibility.
- If an evaluation dataset mixer is configured, loads and optionally shuffles the evaluation dataset.
- Visualizes the first tokenized example for debugging purposes.
Usage
Import and call this function during GRPO training initialization, before creating the data preparation actor or any generation engines. It should be called exactly once per training run.
Code Reference
Source Location
- Repository: Open Instruct
- File:
open_instruct/grpo_fast.py
Signature
def setup_datasets(
args: grpo_utils.ExperimentConfig,
tc: TokenizerConfig,
tokenizer: PreTrainedTokenizer,
streaming_config: data_loader_lib.StreamingDataLoaderConfig,
tool_definitions: list[dict[str, Any]],
pass_tools_to_chat_template: bool,
configured_tool_call_names: list[str] | None = None,
) -> tuple[Dataset, Dataset | None]:
Import
from open_instruct.grpo_fast import setup_datasets
I/O Contract
Inputs
| Name | Type | Description |
|---|---|---|
args |
ExperimentConfig |
Experiment configuration containing seed, hf_entity, etc. |
tc |
TokenizerConfig |
Tokenizer configuration specifying chat template and special tokens. |
tokenizer |
PreTrainedTokenizer |
The tokenizer instance for encoding prompts. |
streaming_config |
StreamingDataLoaderConfig |
Configuration for dataset mixers, transforms, caching, and prompt length limits. |
tool_definitions |
list[dict[str, Any]] |
OpenAI-format tool definitions to include in chat templates. |
pass_tools_to_chat_template |
bool |
Whether to inject tool definitions into the chat template. |
configured_tool_call_names |
None | Optional list of valid tool call names for validation against per-sample tools. |
Outputs
| Name | Type | Description |
|---|---|---|
train_dataset |
Dataset |
Shuffled HuggingFace Dataset with tokenized prompts, ground truths, and verifier source metadata. Each example contains input_ids_prompt, ground_truths, verifier_source, and raw_prompt.
|
eval_dataset |
None | Evaluation dataset (may be None if no eval mixer is configured). Same schema as training dataset. |
Usage Examples
from open_instruct.grpo_fast import setup_datasets
from open_instruct.grpo_utils import ExperimentConfig
from open_instruct.data_loader import StreamingDataLoaderConfig
from transformers import AutoTokenizer
args = ExperimentConfig(seed=42, hf_entity="allenai")
streaming_config = StreamingDataLoaderConfig(
dataset_mixer_list=["ai2-adapt-dev/rlvr_gsm8k_zs", "1.0"],
dataset_mixer_eval_list=["ai2-adapt-dev/rlvr_gsm8k_zs", "1.0"],
max_prompt_token_length=256,
)
tokenizer = AutoTokenizer.from_pretrained("allenai/OLMo-2-7B")
train_dataset, eval_dataset = setup_datasets(
args=args,
tc=tokenizer_config,
tokenizer=tokenizer,
streaming_config=streaming_config,
tool_definitions=[],
pass_tools_to_chat_template=False,
)
print(f"Train: {len(train_dataset)} examples, Eval: {len(eval_dataset)} examples")
Related Pages
Implements Principle
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment