Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Microsoft DeepSpeedExamples Instruction Dataset Preparation

From Leeroopedia


Metadata

Field Value
Page Type Principle
Title Instruction_Dataset_Preparation
Repository Microsoft/DeepSpeedExamples
Sources Paper: Alpaca https://crfm.stanford.edu/2023/03/13/alpaca.html
Domains NLP, Data_Processing, Fine_Tuning
Status Active
Related Implementation Implementation:Microsoft_DeepSpeedExamples_Load_And_Preprocess_Dataset

Overview

A data processing technique that formats instruction-following datasets using structured templates for causal language model fine-tuning.

Description

Instruction datasets use a standardized template consisting of three fields:

  • instruction -- The task description telling the model what to do.
  • input -- An optional additional context or input for the task.
  • output (response) -- The expected model response.

The Alpaca format wraps these fields in a structured prompt template. Each example is formatted as:

### Instruction:
{instruction}

### Input:
{input}

### Response:
{output}

When the input field is empty or absent, the ### Input: section is omitted entirely.

The preprocessing pipeline performs the following steps:

  1. Template application -- Each raw example is formatted according to the Alpaca template using string formatting.
  2. Tokenization -- The formatted text is converted to model input IDs using the model's tokenizer with truncation to max_length and padding to max_length.
  3. Label assignment -- The input_ids are copied to labels for causal language model training, where the model predicts each next token.
  4. Dataset subsetting -- An optional percentage parameter allows using only a fraction of the full dataset for benchmarking or quick experiments.

Theoretical Basis

For causal language model (CLM) fine-tuning, the labels are set equal to the input_ids. The model internally shifts the labels by one position so that at each position i, it predicts token i+1 given tokens 0 through i. This is the standard autoregressive training objective:

L = -sum_{i=1}^{N} log P(token_i | token_0, ..., token_{i-1})

The template structure ensures the model learns to follow instructions by establishing a consistent pattern:

### Instruction:
{instruction}

### Input:
{input}

### Response:
{response}

During inference, the model is prompted with the instruction and input sections, and it generates the response section by continuing the pattern it learned during fine-tuning.

Key design decisions:

  • Truncation to max_length -- Prevents out-of-memory errors and ensures uniform tensor shapes. Default is 2048 tokens.
  • Padding to max_length -- Creates uniform-length tensors for efficient batching. The pad_token is set to eos_token if not defined by the tokenizer.
  • Labels = input_ids -- The entire sequence (instruction + input + response) is used as both input and target. This means the model learns to predict the instruction tokens as well, not just the response. This is a simpler but effective approach compared to masking instruction tokens in the labels.

Dataset Format

The Alpaca dataset (tatsu-lab/alpaca) contains 52,000 instruction-following examples with the following schema:

Field Type Description Example
instruction string Task description "Give three tips for staying healthy."
input string (optional) Additional context "" (often empty)
output string Expected response "1. Eat a balanced diet..."

Template Constants

The following template constants are defined in finetune_zero3.py:

ALPACA_INSTRUCTION_TEMPLATE = "### Instruction:\n{instruction}\n\n"
ALPACA_INPUT_TEMPLATE = "### Input:\n{input}\n\n"
ALPACA_RESPONSE_TEMPLATE = "### Response:\n{output}"

Usage Pattern

  1. Load a HuggingFace dataset by name (e.g., tatsu-lab/alpaca).
  2. Optionally select a percentage subset for benchmarking.
  3. Apply the Alpaca template to each example.
  4. Tokenize with the model's tokenizer using truncation and padding.
  5. Set labels = input_ids.copy() for causal LM training.
  6. Wrap in a PyTorch DataLoader with batch_size=1 and shuffle=True.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment