Principle:Microsoft DeepSpeedExamples Instruction Dataset Preparation
Metadata
| Field | Value |
|---|---|
| Page Type | Principle |
| Title | Instruction_Dataset_Preparation |
| Repository | Microsoft/DeepSpeedExamples |
| Sources | Paper: Alpaca https://crfm.stanford.edu/2023/03/13/alpaca.html |
| Domains | NLP, Data_Processing, Fine_Tuning |
| Status | Active |
| Related Implementation | Implementation:Microsoft_DeepSpeedExamples_Load_And_Preprocess_Dataset |
Overview
A data processing technique that formats instruction-following datasets using structured templates for causal language model fine-tuning.
Description
Instruction datasets use a standardized template consisting of three fields:
- instruction -- The task description telling the model what to do.
- input -- An optional additional context or input for the task.
- output (response) -- The expected model response.
The Alpaca format wraps these fields in a structured prompt template. Each example is formatted as:
### Instruction:
{instruction}
### Input:
{input}
### Response:
{output}
When the input field is empty or absent, the ### Input: section is omitted entirely.
The preprocessing pipeline performs the following steps:
- Template application -- Each raw example is formatted according to the Alpaca template using string formatting.
- Tokenization -- The formatted text is converted to model input IDs using the model's tokenizer with truncation to
max_lengthand padding tomax_length. - Label assignment -- The
input_idsare copied tolabelsfor causal language model training, where the model predicts each next token. - Dataset subsetting -- An optional percentage parameter allows using only a fraction of the full dataset for benchmarking or quick experiments.
Theoretical Basis
For causal language model (CLM) fine-tuning, the labels are set equal to the input_ids. The model internally shifts the labels by one position so that at each position i, it predicts token i+1 given tokens 0 through i. This is the standard autoregressive training objective:
L = -sum_{i=1}^{N} log P(token_i | token_0, ..., token_{i-1})
The template structure ensures the model learns to follow instructions by establishing a consistent pattern:
### Instruction:
{instruction}
### Input:
{input}
### Response:
{response}
During inference, the model is prompted with the instruction and input sections, and it generates the response section by continuing the pattern it learned during fine-tuning.
Key design decisions:
- Truncation to max_length -- Prevents out-of-memory errors and ensures uniform tensor shapes. Default is 2048 tokens.
- Padding to max_length -- Creates uniform-length tensors for efficient batching. The
pad_tokenis set toeos_tokenif not defined by the tokenizer. - Labels = input_ids -- The entire sequence (instruction + input + response) is used as both input and target. This means the model learns to predict the instruction tokens as well, not just the response. This is a simpler but effective approach compared to masking instruction tokens in the labels.
Dataset Format
The Alpaca dataset (tatsu-lab/alpaca) contains 52,000 instruction-following examples with the following schema:
| Field | Type | Description | Example |
|---|---|---|---|
instruction |
string | Task description | "Give three tips for staying healthy." |
input |
string (optional) | Additional context | "" (often empty) |
output |
string | Expected response | "1. Eat a balanced diet..." |
Template Constants
The following template constants are defined in finetune_zero3.py:
ALPACA_INSTRUCTION_TEMPLATE = "### Instruction:\n{instruction}\n\n"
ALPACA_INPUT_TEMPLATE = "### Input:\n{input}\n\n"
ALPACA_RESPONSE_TEMPLATE = "### Response:\n{output}"
Usage Pattern
- Load a HuggingFace dataset by name (e.g.,
tatsu-lab/alpaca). - Optionally select a percentage subset for benchmarking.
- Apply the Alpaca template to each example.
- Tokenize with the model's tokenizer using truncation and padding.
- Set
labels = input_ids.copy()for causal LM training. - Wrap in a PyTorch
DataLoaderwithbatch_size=1andshuffle=True.