Implementation:Microsoft DeepSpeedExamples Load And Preprocess Dataset

Metadata

Field	Value
Page Type	Implementation
Title	Load_And_Preprocess_Dataset
Repository	Microsoft/DeepSpeedExamples
Type	Direct Function
Code Reference	File: `training/DeepSpeed-SuperOffload/finetune_zero3.py`, Lines 171-205
Import	Direct functions in `finetune_zero3.py`
Related Principle	Principle:Microsoft_DeepSpeedExamples_Instruction_Dataset_Preparation

Overview

Concrete tool for loading and tokenizing Alpaca-format instruction datasets for SuperOffload fine-tuning. Provides two functions: preprocess_alpaca_example for per-example template formatting and tokenization, and load_and_preprocess_dataset for end-to-end dataset loading and DataLoader creation.

Function: preprocess_alpaca_example

Signature

def preprocess_alpaca_example(
    example: Dict[str, str],
    tokenizer: AutoTokenizer,
    max_length: int = 2048
) -> Dict[str, Any]:

Code Reference: File: training/DeepSpeed-SuperOffload/finetune_zero3.py, Lines 81-103

Description

Formats a single Alpaca-format example using the instruction/input/response template, tokenizes it, and sets labels equal to input_ids for causal LM training.

Implementation

def preprocess_alpaca_example(
    example: Dict[str, str],
    tokenizer: AutoTokenizer,
    max_length: int = 2048
) -> Dict[str, Any]:
    prompt = ALPACA_INSTRUCTION_TEMPLATE.format(instruction=example['instruction'])

    if example.get("input", "").strip():
        prompt += ALPACA_INPUT_TEMPLATE.format(input=example['input'])

    prompt += ALPACA_RESPONSE_TEMPLATE.format(output=example['output'])

    tokenized = tokenizer(
        prompt,
        truncation=True,
        max_length=max_length,
        padding="max_length",
        return_tensors=None
    )

    tokenized["labels"] = tokenized["input_ids"].copy()

    return tokenized

I/O Contract

Parameter	Type	Description	Default
`example`	`Dict[str, str]`	A dictionary with keys `instruction`, `input` (optional), `output`	(required)
`tokenizer`	`AutoTokenizer`	HuggingFace tokenizer for the target model	(required)
`max_length`	`int`	Maximum sequence length for truncation and padding	2048

Returns: Dict[str, Any] containing:

input_ids -- List of token IDs, length = max_length
attention_mask -- List of 0/1 values indicating real vs. padding tokens
labels -- Copy of input_ids for causal LM loss computation

Function: load_and_preprocess_dataset

Signature

def load_and_preprocess_dataset(
    dataset_name: str,
    dataset_percentage: float,
    tokenizer: AutoTokenizer,
    max_length: int,
    logger: logging.Logger
) -> Tuple[Any, DataLoader]:

Code Reference: File: training/DeepSpeed-SuperOffload/finetune_zero3.py, Lines 171-205

Description

Loads a HuggingFace dataset by name, optionally selects a percentage subset, tokenizes all examples using preprocess_alpaca_example, and wraps the result in a PyTorch DataLoader.

Implementation

def load_and_preprocess_dataset(
    dataset_name: str,
    dataset_percentage: float,
    tokenizer: AutoTokenizer,
    max_length: int,
    logger: logging.Logger
) -> Tuple[Any, DataLoader]:
    logger.debug(f"Loading dataset: {dataset_name}")

    dataset = load_dataset(dataset_name)
    original_size = len(dataset["train"])

    if dataset_percentage < 100.0:
        subset_size = int(original_size * dataset_percentage / 100.0)
        dataset["train"] = dataset["train"].select(range(subset_size))
        logger.debug(f"Using {dataset_percentage}% of dataset: "
                     f"{subset_size}/{original_size} examples")
    else:
        logger.debug(f"Using full dataset: {original_size} examples")

    logger.debug("Tokenizing dataset...")

    tokenized_dataset = dataset["train"].map(
        lambda x: preprocess_alpaca_example(x, tokenizer, max_length),
        batched=False,
        desc="Tokenizing"
    )

    train_dataloader = DataLoader(
        tokenized_dataset,
        batch_size=1,
        collate_fn=default_data_collator,
        shuffle=True
    )

    return tokenized_dataset, train_dataloader

I/O Contract

Parameter	Type	Description	Default
`dataset_name`	`str`	HuggingFace dataset identifier (e.g., `"tatsu-lab/alpaca"`)	(required)
`dataset_percentage`	`float`	Percentage of the training split to use (1.0-100.0)	(required)
`tokenizer`	`AutoTokenizer`	HuggingFace tokenizer for the target model	(required)
`max_length`	`int`	Maximum sequence length for tokenization	(required)
`logger`	`logging.Logger`	Logger instance for debug output	(required)

Returns: Tuple[Any, DataLoader] containing:

Element 0 -- The tokenized HuggingFace Dataset object
Element 1 -- A PyTorch DataLoader with batch_size=1, shuffle=True, and default_data_collator

Template Constants

These constants are defined at the module level in finetune_zero3.py (Lines 55-57):

ALPACA_INSTRUCTION_TEMPLATE = "### Instruction:\n{instruction}\n\n"
ALPACA_INPUT_TEMPLATE = "### Input:\n{input}\n\n"
ALPACA_RESPONSE_TEMPLATE = "### Response:\n{output}"

Usage Example

from transformers import AutoTokenizer

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# Load and preprocess dataset (10% subset)
tokenized_dataset, train_dataloader = load_and_preprocess_dataset(
    dataset_name="tatsu-lab/alpaca",
    dataset_percentage=10.0,
    tokenizer=tokenizer,
    max_length=4096,
    logger=logger
)

# Iterate over batches
for batch in train_dataloader:
    # batch contains: input_ids, attention_mask, labels
    # Each tensor has shape [1, max_length]
    pass

Invocation in Main Script

In the main() function (Lines 248-250):

tokenized_dataset, train_dataloader = load_and_preprocess_dataset(
    args.dataset_name, args.dataset_percentage, tokenizer, args.max_length, logger
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment