Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Allenai Open instruct Get Cached Dataset Tulu

From Leeroopedia


Knowledge Sources
Domains Machine Learning, Data Engineering, Natural Language Processing
Last Updated 2026-02-07 00:00 GMT

Overview

Concrete tool for loading, mixing, tokenizing, and caching instruction-tuning datasets provided by the Open Instruct library.

Description

The get_cached_dataset_tulu() function is the primary entry point for preparing SFT training data. It orchestrates the full dataset pipeline: loading datasets from the HuggingFace Hub or local paths, mixing them according to specified ratios, applying transformation functions (tokenization, truncation, filtering), and caching the results using a SHA-based configuration hash. Internally, it delegates to get_cached_dataset_tulu_with_statistics() and returns only the dataset (dropping the statistics). The cache can be stored locally on disk or pushed to the HuggingFace Hub for shared access across machines.

Usage

Import and call this function when setting up training data for SFT. Typically invoked in the main() function of finetune.py within an accelerator.main_process_first() context so that only the main process performs the data transformation, while other processes load from the cache.

Code Reference

Source Location

  • Repository: Open Instruct
  • File: open_instruct/dataset_transformation.py
  • Lines: L2072-2102

Signature

def get_cached_dataset_tulu(
    dataset_mixer_list: list[str],
    dataset_mixer_list_splits: list[str],
    tc: TokenizerConfig,
    dataset_transform_fn: list[str],
    transform_fn_args: list[dict[str, Any]],
    target_columns: list[str] | None = None,
    dataset_cache_mode: Literal["hf", "local"] = "local",
    dataset_config_hash: str | None = None,
    hf_entity: str | None = None,
    dataset_local_cache_dir: str = "local_dataset_cache",
    dataset_skip_cache: bool = False,
    dataset_config_seed: int = 42,
    system_prompt_override: str | None = None,
) -> Dataset:

Import

from open_instruct.dataset_transformation import get_cached_dataset_tulu

I/O Contract

Inputs

Name Type Required Description
dataset_mixer_list list[str] Yes Alternating list of dataset names/paths and their mixing ratios (e.g., ["dataset_name", "1.0", "other_dataset", "0.5"]).
dataset_mixer_list_splits list[str] Yes The dataset splits to use (e.g., ["train"]). Applied cyclically to the datasets in the mixer list.
tc TokenizerConfig Yes Tokenizer configuration containing the tokenizer path, chat template name, and other tokenizer settings.
dataset_transform_fn list[str] Yes List of transformation function names to apply sequentially (e.g., ["sft_tulu_tokenize_and_truncate_v1", "sft_tulu_filter_v1"]).
transform_fn_args list[dict[str, Any]] Yes Arguments for each transform function. Must have the same length as dataset_transform_fn.
target_columns list[str] or None No Columns to keep in the final dataset. If None, all columns are kept.
dataset_cache_mode Literal["hf", "local"] No Where to store the cache: "local" (disk) or "hf" (HuggingFace Hub). Defaults to "local".
dataset_config_hash str or None No Pre-computed config hash. If None, the hash is computed from the configuration.
hf_entity str or None No HuggingFace entity for Hub caching. Required if dataset_cache_mode="hf".
dataset_local_cache_dir str No Directory for local caching. Defaults to "local_dataset_cache".
dataset_skip_cache bool No If True, bypasses the cache and reprocesses from scratch. Defaults to False.
dataset_config_seed int No Random seed for dataset shuffling and sampling. Defaults to 42.
system_prompt_override str or None No If set, overrides the system prompt in the chat messages.

Outputs

Name Type Description
dataset Dataset A HuggingFace Dataset object with columns: input_ids (list[int]), attention_mask (list[int]), labels (list[int]). The labels have non-assistant tokens masked with -100.

Usage Examples

Basic Usage

from open_instruct.dataset_transformation import get_cached_dataset_tulu, TokenizerConfig

tc = TokenizerConfig(
    tokenizer_name_or_path="allenai/Llama-3.1-Tulu-3-8B",
    chat_template_name="tulu",
)

dataset = get_cached_dataset_tulu(
    dataset_mixer_list=["allenai/tulu-3-sft-personas-algebra", "1.0"],
    dataset_mixer_list_splits=["train"],
    tc=tc,
    dataset_transform_fn=["sft_tulu_tokenize_and_truncate_v1", "sft_tulu_filter_v1"],
    transform_fn_args=[{"max_seq_length": 2048}, {}],
    target_columns=["input_ids", "attention_mask", "labels"],
)

# dataset is ready for training
print(dataset.column_names)  # ['input_ids', 'attention_mask', 'labels']
print(len(dataset))

Multi-Dataset Mixing

dataset = get_cached_dataset_tulu(
    dataset_mixer_list=[
        "allenai/tulu-3-sft-personas-algebra", "0.5",
        "allenai/tulu-3-sft-personas-code", "0.3",
        "allenai/tulu-3-sft-personas-general", "0.2",
    ],
    dataset_mixer_list_splits=["train"],
    tc=tc,
    dataset_transform_fn=["sft_tulu_tokenize_and_truncate_v1", "sft_tulu_filter_v1"],
    transform_fn_args=[{"max_seq_length": 4096}, {}],
)

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment