Principle:Allenai Open instruct SFT Dataset Preparation

Knowledge Sources	Scaling Data-Constrained Language Models FLAN: Finetuned Language Models Are Zero-Shot Learners Open Instruct
Domains	Machine Learning, Natural Language Processing, Data Engineering
Last Updated	2026-02-07 00:00 GMT

Overview

SFT dataset preparation is the process of loading, mixing, tokenizing, and caching instruction-tuning datasets so that they are ready for supervised fine-tuning of language models.

Description

Before a language model can be fine-tuned on instruction-following data, the raw datasets must undergo several transformation stages. This principle covers the end-to-end pipeline of preparing training data for SFT:

Dataset mixing allows combining multiple instruction-tuning datasets in specified ratios. Each dataset may contribute a different proportion of the final training set, controlled by mixing ratios. For example, a mix might include 50% of a math reasoning dataset, 30% of a general instruction-following set, and 20% of a code dataset. Mixing ratios can be specified as fractions (proportions of the original dataset) or absolute sample counts.

Tokenization converts the raw text conversations into token IDs suitable for the model. This involves applying a chat template to structure the conversation into the model's expected format (e.g., adding special tokens for user/assistant turns), then encoding the result into integer token sequences. The tokenizer must be properly configured with the correct chat template, BOS/EOS tokens, and special tokens for the target model.

Caching avoids redundant reprocessing. A SHA-based hash is computed from the dataset configuration (dataset names, mixing ratios, tokenizer settings, transform functions) to create a deterministic cache key. If a cached version with the same hash exists, it is loaded directly. Caching can be performed locally on disk or via the HuggingFace Hub.

Usage

Use this technique whenever preparing training data for SFT. It is especially important when:

Training with multiple datasets that need to be combined in specific ratios
Iterating on experiments where re-tokenizing the same data would waste compute
Working in distributed training environments where data consistency across workers is critical
Reproducing prior training runs, since the SHA-based cache key ensures identical data preparation

Theoretical Basis

The dataset preparation pipeline follows a functional transformation model:

raw_datasets -> mix(ratios) -> tokenize(chat_template) -> filter(max_seq_length) -> cache(SHA_hash) -> Dataset

Mixing ratios: Given datasets D_1, D_2, ..., D_k with mixing ratios r_1, r_2, ..., r_k, the final dataset D is formed by sampling n_i = r_i * |D_i| examples from each dataset (when r_i is a float proportion) or n_i = r_i examples (when r_i is an integer count).

Cache key computation: The config hash is computed as:

hash = SHA256(
    dataset_names || mixing_ratios || split_names ||
    tokenizer_files_hash || chat_template ||
    transform_fn_names || transform_fn_args ||
    seed
)

This ensures that any change to the dataset configuration produces a different cache key, preventing stale data from being used.

Tokenization with chat templates: Each conversation is formatted as:

[BOS] <|user|>\n{user_message}\n<|assistant|>\n{assistant_message}[EOS]

The exact format depends on the chat template (e.g., tulu, zephyr, chatml). Labels are constructed by copying the input_ids and masking non-assistant tokens with -100, so the loss is computed only on the assistant's responses.

Related Pages

Implemented By

Implementation:Allenai_Open_instruct_Get_Cached_Dataset_Tulu

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment