Implementation:Allenai Open instruct Get Cached Dataset Tulu

Knowledge Sources	Open Instruct
Domains	Machine Learning, Data Engineering, Natural Language Processing
Last Updated	2026-02-07 00:00 GMT

Overview

Concrete tool for loading, mixing, tokenizing, and caching instruction-tuning datasets provided by the Open Instruct library.

Description

The get_cached_dataset_tulu() function is the primary entry point for preparing SFT training data. It orchestrates the full dataset pipeline: loading datasets from the HuggingFace Hub or local paths, mixing them according to specified ratios, applying transformation functions (tokenization, truncation, filtering), and caching the results using a SHA-based configuration hash. Internally, it delegates to get_cached_dataset_tulu_with_statistics() and returns only the dataset (dropping the statistics). The cache can be stored locally on disk or pushed to the HuggingFace Hub for shared access across machines.

Usage

Import and call this function when setting up training data for SFT. Typically invoked in the main() function of finetune.py within an accelerator.main_process_first() context so that only the main process performs the data transformation, while other processes load from the cache.

Code Reference

Source Location

Repository: Open Instruct
File: open_instruct/dataset_transformation.py
Lines: L2072-2102

Signature

def get_cached_dataset_tulu(
    dataset_mixer_list: list[str],
    dataset_mixer_list_splits: list[str],
    tc: TokenizerConfig,
    dataset_transform_fn: list[str],
    transform_fn_args: list[dict[str, Any]],
    target_columns: list[str] | None = None,
    dataset_cache_mode: Literal["hf", "local"] = "local",
    dataset_config_hash: str | None = None,
    hf_entity: str | None = None,
    dataset_local_cache_dir: str = "local_dataset_cache",
    dataset_skip_cache: bool = False,
    dataset_config_seed: int = 42,
    system_prompt_override: str | None = None,
) -> Dataset:

Import

from open_instruct.dataset_transformation import get_cached_dataset_tulu

I/O Contract

Inputs

Name	Type	Required	Description
dataset_mixer_list	list[str]	Yes	Alternating list of dataset names/paths and their mixing ratios (e.g., `["dataset_name", "1.0", "other_dataset", "0.5"]`).
dataset_mixer_list_splits	list[str]	Yes	The dataset splits to use (e.g., `["train"]`). Applied cyclically to the datasets in the mixer list.
tc	TokenizerConfig	Yes	Tokenizer configuration containing the tokenizer path, chat template name, and other tokenizer settings.
dataset_transform_fn	list[str]	Yes	List of transformation function names to apply sequentially (e.g., `["sft_tulu_tokenize_and_truncate_v1", "sft_tulu_filter_v1"]`).
transform_fn_args	list[dict[str, Any]]	Yes	Arguments for each transform function. Must have the same length as `dataset_transform_fn`.
target_columns	list[str] or None	No	Columns to keep in the final dataset. If None, all columns are kept.
dataset_cache_mode	Literal["hf", "local"]	No	Where to store the cache: `"local"` (disk) or `"hf"` (HuggingFace Hub). Defaults to `"local"`.
dataset_config_hash	str or None	No	Pre-computed config hash. If None, the hash is computed from the configuration.
hf_entity	str or None	No	HuggingFace entity for Hub caching. Required if `dataset_cache_mode="hf"`.
dataset_local_cache_dir	str	No	Directory for local caching. Defaults to `"local_dataset_cache"`.
dataset_skip_cache	bool	No	If True, bypasses the cache and reprocesses from scratch. Defaults to False.
dataset_config_seed	int	No	Random seed for dataset shuffling and sampling. Defaults to 42.
system_prompt_override	str or None	No	If set, overrides the system prompt in the chat messages.

Outputs

Name	Type	Description
dataset	Dataset	A HuggingFace `Dataset` object with columns: `input_ids` (list[int]), `attention_mask` (list[int]), `labels` (list[int]). The labels have non-assistant tokens masked with -100.

Usage Examples

Basic Usage

from open_instruct.dataset_transformation import get_cached_dataset_tulu, TokenizerConfig

tc = TokenizerConfig(
    tokenizer_name_or_path="allenai/Llama-3.1-Tulu-3-8B",
    chat_template_name="tulu",
)

dataset = get_cached_dataset_tulu(
    dataset_mixer_list=["allenai/tulu-3-sft-personas-algebra", "1.0"],
    dataset_mixer_list_splits=["train"],
    tc=tc,
    dataset_transform_fn=["sft_tulu_tokenize_and_truncate_v1", "sft_tulu_filter_v1"],
    transform_fn_args=[{"max_seq_length": 2048}, {}],
    target_columns=["input_ids", "attention_mask", "labels"],
)

# dataset is ready for training
print(dataset.column_names)  # ['input_ids', 'attention_mask', 'labels']
print(len(dataset))

Multi-Dataset Mixing

dataset = get_cached_dataset_tulu(
    dataset_mixer_list=[
        "allenai/tulu-3-sft-personas-algebra", "0.5",
        "allenai/tulu-3-sft-personas-code", "0.3",
        "allenai/tulu-3-sft-personas-general", "0.2",
    ],
    dataset_mixer_list_splits=["train"],
    tc=tc,
    dataset_transform_fn=["sft_tulu_tokenize_and_truncate_v1", "sft_tulu_filter_v1"],
    transform_fn_args=[{"max_seq_length": 4096}, {}],
)

Related Pages

Implements Principle

Principle:Allenai_Open_instruct_SFT_Dataset_Preparation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment