Principle:Allenai Open instruct SFT Dataset Preparation
| Knowledge Sources | |
|---|---|
| Domains | Machine Learning, Natural Language Processing, Data Engineering |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
SFT dataset preparation is the process of loading, mixing, tokenizing, and caching instruction-tuning datasets so that they are ready for supervised fine-tuning of language models.
Description
Before a language model can be fine-tuned on instruction-following data, the raw datasets must undergo several transformation stages. This principle covers the end-to-end pipeline of preparing training data for SFT:
Dataset mixing allows combining multiple instruction-tuning datasets in specified ratios. Each dataset may contribute a different proportion of the final training set, controlled by mixing ratios. For example, a mix might include 50% of a math reasoning dataset, 30% of a general instruction-following set, and 20% of a code dataset. Mixing ratios can be specified as fractions (proportions of the original dataset) or absolute sample counts.
Tokenization converts the raw text conversations into token IDs suitable for the model. This involves applying a chat template to structure the conversation into the model's expected format (e.g., adding special tokens for user/assistant turns), then encoding the result into integer token sequences. The tokenizer must be properly configured with the correct chat template, BOS/EOS tokens, and special tokens for the target model.
Caching avoids redundant reprocessing. A SHA-based hash is computed from the dataset configuration (dataset names, mixing ratios, tokenizer settings, transform functions) to create a deterministic cache key. If a cached version with the same hash exists, it is loaded directly. Caching can be performed locally on disk or via the HuggingFace Hub.
Usage
Use this technique whenever preparing training data for SFT. It is especially important when:
- Training with multiple datasets that need to be combined in specific ratios
- Iterating on experiments where re-tokenizing the same data would waste compute
- Working in distributed training environments where data consistency across workers is critical
- Reproducing prior training runs, since the SHA-based cache key ensures identical data preparation
Theoretical Basis
The dataset preparation pipeline follows a functional transformation model:
raw_datasets -> mix(ratios) -> tokenize(chat_template) -> filter(max_seq_length) -> cache(SHA_hash) -> Dataset
Mixing ratios: Given datasets D_1, D_2, ..., D_k with mixing ratios r_1, r_2, ..., r_k, the final dataset D is formed by sampling n_i = r_i * |D_i| examples from each dataset (when r_i is a float proportion) or n_i = r_i examples (when r_i is an integer count).
Cache key computation: The config hash is computed as:
hash = SHA256(
dataset_names || mixing_ratios || split_names ||
tokenizer_files_hash || chat_template ||
transform_fn_names || transform_fn_args ||
seed
)
This ensures that any change to the dataset configuration produces a different cache key, preventing stale data from being used.
Tokenization with chat templates: Each conversation is formatted as:
[BOS] <|user|>\n{user_message}\n<|assistant|>\n{assistant_message}[EOS]
The exact format depends on the chat template (e.g., tulu, zephyr, chatml). Labels are constructed by copying the input_ids and masking non-assistant tokens with -100, so the loss is computed only on the assistant's responses.