Implementation:Allenai Open instruct Get Cached Dataset Tulu
| Knowledge Sources | |
|---|---|
| Domains | Machine Learning, Data Engineering, Natural Language Processing |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Concrete tool for loading, mixing, tokenizing, and caching instruction-tuning datasets provided by the Open Instruct library.
Description
The get_cached_dataset_tulu() function is the primary entry point for preparing SFT training data. It orchestrates the full dataset pipeline: loading datasets from the HuggingFace Hub or local paths, mixing them according to specified ratios, applying transformation functions (tokenization, truncation, filtering), and caching the results using a SHA-based configuration hash. Internally, it delegates to get_cached_dataset_tulu_with_statistics() and returns only the dataset (dropping the statistics). The cache can be stored locally on disk or pushed to the HuggingFace Hub for shared access across machines.
Usage
Import and call this function when setting up training data for SFT. Typically invoked in the main() function of finetune.py within an accelerator.main_process_first() context so that only the main process performs the data transformation, while other processes load from the cache.
Code Reference
Source Location
- Repository: Open Instruct
- File:
open_instruct/dataset_transformation.py - Lines: L2072-2102
Signature
def get_cached_dataset_tulu(
dataset_mixer_list: list[str],
dataset_mixer_list_splits: list[str],
tc: TokenizerConfig,
dataset_transform_fn: list[str],
transform_fn_args: list[dict[str, Any]],
target_columns: list[str] | None = None,
dataset_cache_mode: Literal["hf", "local"] = "local",
dataset_config_hash: str | None = None,
hf_entity: str | None = None,
dataset_local_cache_dir: str = "local_dataset_cache",
dataset_skip_cache: bool = False,
dataset_config_seed: int = 42,
system_prompt_override: str | None = None,
) -> Dataset:
Import
from open_instruct.dataset_transformation import get_cached_dataset_tulu
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| dataset_mixer_list | list[str] | Yes | Alternating list of dataset names/paths and their mixing ratios (e.g., ["dataset_name", "1.0", "other_dataset", "0.5"]).
|
| dataset_mixer_list_splits | list[str] | Yes | The dataset splits to use (e.g., ["train"]). Applied cyclically to the datasets in the mixer list.
|
| tc | TokenizerConfig | Yes | Tokenizer configuration containing the tokenizer path, chat template name, and other tokenizer settings. |
| dataset_transform_fn | list[str] | Yes | List of transformation function names to apply sequentially (e.g., ["sft_tulu_tokenize_and_truncate_v1", "sft_tulu_filter_v1"]).
|
| transform_fn_args | list[dict[str, Any]] | Yes | Arguments for each transform function. Must have the same length as dataset_transform_fn.
|
| target_columns | list[str] or None | No | Columns to keep in the final dataset. If None, all columns are kept. |
| dataset_cache_mode | Literal["hf", "local"] | No | Where to store the cache: "local" (disk) or "hf" (HuggingFace Hub). Defaults to "local".
|
| dataset_config_hash | str or None | No | Pre-computed config hash. If None, the hash is computed from the configuration. |
| hf_entity | str or None | No | HuggingFace entity for Hub caching. Required if dataset_cache_mode="hf".
|
| dataset_local_cache_dir | str | No | Directory for local caching. Defaults to "local_dataset_cache".
|
| dataset_skip_cache | bool | No | If True, bypasses the cache and reprocesses from scratch. Defaults to False. |
| dataset_config_seed | int | No | Random seed for dataset shuffling and sampling. Defaults to 42. |
| system_prompt_override | str or None | No | If set, overrides the system prompt in the chat messages. |
Outputs
| Name | Type | Description |
|---|---|---|
| dataset | Dataset | A HuggingFace Dataset object with columns: input_ids (list[int]), attention_mask (list[int]), labels (list[int]). The labels have non-assistant tokens masked with -100.
|
Usage Examples
Basic Usage
from open_instruct.dataset_transformation import get_cached_dataset_tulu, TokenizerConfig
tc = TokenizerConfig(
tokenizer_name_or_path="allenai/Llama-3.1-Tulu-3-8B",
chat_template_name="tulu",
)
dataset = get_cached_dataset_tulu(
dataset_mixer_list=["allenai/tulu-3-sft-personas-algebra", "1.0"],
dataset_mixer_list_splits=["train"],
tc=tc,
dataset_transform_fn=["sft_tulu_tokenize_and_truncate_v1", "sft_tulu_filter_v1"],
transform_fn_args=[{"max_seq_length": 2048}, {}],
target_columns=["input_ids", "attention_mask", "labels"],
)
# dataset is ready for training
print(dataset.column_names) # ['input_ids', 'attention_mask', 'labels']
print(len(dataset))
Multi-Dataset Mixing
dataset = get_cached_dataset_tulu(
dataset_mixer_list=[
"allenai/tulu-3-sft-personas-algebra", "0.5",
"allenai/tulu-3-sft-personas-code", "0.3",
"allenai/tulu-3-sft-personas-general", "0.2",
],
dataset_mixer_list_splits=["train"],
tc=tc,
dataset_transform_fn=["sft_tulu_tokenize_and_truncate_v1", "sft_tulu_filter_v1"],
transform_fn_args=[{"max_seq_length": 4096}, {}],
)