Principle:Microsoft DeepSpeedExamples Instruction Dataset Preparation

Metadata

Field	Value
Page Type	Principle
Title	Instruction_Dataset_Preparation
Repository	Microsoft/DeepSpeedExamples
Sources	Paper: Alpaca https://crfm.stanford.edu/2023/03/13/alpaca.html
Domains	NLP, Data_Processing, Fine_Tuning
Status	Active
Related Implementation	Implementation:Microsoft_DeepSpeedExamples_Load_And_Preprocess_Dataset

Overview

A data processing technique that formats instruction-following datasets using structured templates for causal language model fine-tuning.

Description

Instruction datasets use a standardized template consisting of three fields:

instruction -- The task description telling the model what to do.
input -- An optional additional context or input for the task.
output (response) -- The expected model response.

The Alpaca format wraps these fields in a structured prompt template. Each example is formatted as:

### Instruction:
{instruction}

### Input:
{input}

### Response:
{output}

When the input field is empty or absent, the ### Input: section is omitted entirely.

The preprocessing pipeline performs the following steps:

Template application -- Each raw example is formatted according to the Alpaca template using string formatting.
Tokenization -- The formatted text is converted to model input IDs using the model's tokenizer with truncation to max_length and padding to max_length.
Label assignment -- The input_ids are copied to labels for causal language model training, where the model predicts each next token.
Dataset subsetting -- An optional percentage parameter allows using only a fraction of the full dataset for benchmarking or quick experiments.

Theoretical Basis

For causal language model (CLM) fine-tuning, the labels are set equal to the input_ids. The model internally shifts the labels by one position so that at each position i, it predicts token i+1 given tokens 0 through i. This is the standard autoregressive training objective:

L = -sum_{i=1}^{N} log P(token_i | token_0, ..., token_{i-1})

The template structure ensures the model learns to follow instructions by establishing a consistent pattern:

### Instruction:
{instruction}

### Input:
{input}

### Response:
{response}

During inference, the model is prompted with the instruction and input sections, and it generates the response section by continuing the pattern it learned during fine-tuning.

Key design decisions:

Truncation to max_length -- Prevents out-of-memory errors and ensures uniform tensor shapes. Default is 2048 tokens.
Padding to max_length -- Creates uniform-length tensors for efficient batching. The pad_token is set to eos_token if not defined by the tokenizer.
Labels = input_ids -- The entire sequence (instruction + input + response) is used as both input and target. This means the model learns to predict the instruction tokens as well, not just the response. This is a simpler but effective approach compared to masking instruction tokens in the labels.

Dataset Format

The Alpaca dataset (tatsu-lab/alpaca) contains 52,000 instruction-following examples with the following schema:

Field	Type	Description	Example
`instruction`	string	Task description	"Give three tips for staying healthy."
`input`	string (optional)	Additional context	"" (often empty)
`output`	string	Expected response	"1. Eat a balanced diet..."

Template Constants

The following template constants are defined in finetune_zero3.py:

ALPACA_INSTRUCTION_TEMPLATE = "### Instruction:\n{instruction}\n\n"
ALPACA_INPUT_TEMPLATE = "### Input:\n{input}\n\n"
ALPACA_RESPONSE_TEMPLATE = "### Response:\n{output}"

Usage Pattern

Load a HuggingFace dataset by name (e.g., tatsu-lab/alpaca).
Optionally select a percentage subset for benchmarking.
Apply the Alpaca template to each example.
Tokenize with the model's tokenizer using truncation and padding.
Set labels = input_ids.copy() for causal LM training.
Wrap in a PyTorch DataLoader with batch_size=1 and shuffle=True.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment