Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Microsoft DeepSpeedExamples Load And Preprocess Dataset

From Leeroopedia


Metadata

Field Value
Page Type Implementation
Title Load_And_Preprocess_Dataset
Repository Microsoft/DeepSpeedExamples
Type Direct Function
Code Reference File: training/DeepSpeed-SuperOffload/finetune_zero3.py, Lines 171-205
Import Direct functions in finetune_zero3.py
Related Principle Principle:Microsoft_DeepSpeedExamples_Instruction_Dataset_Preparation

Overview

Concrete tool for loading and tokenizing Alpaca-format instruction datasets for SuperOffload fine-tuning. Provides two functions: preprocess_alpaca_example for per-example template formatting and tokenization, and load_and_preprocess_dataset for end-to-end dataset loading and DataLoader creation.

Function: preprocess_alpaca_example

Signature

def preprocess_alpaca_example(
    example: Dict[str, str],
    tokenizer: AutoTokenizer,
    max_length: int = 2048
) -> Dict[str, Any]:

Code Reference: File: training/DeepSpeed-SuperOffload/finetune_zero3.py, Lines 81-103

Description

Formats a single Alpaca-format example using the instruction/input/response template, tokenizes it, and sets labels equal to input_ids for causal LM training.

Implementation

def preprocess_alpaca_example(
    example: Dict[str, str],
    tokenizer: AutoTokenizer,
    max_length: int = 2048
) -> Dict[str, Any]:
    prompt = ALPACA_INSTRUCTION_TEMPLATE.format(instruction=example['instruction'])

    if example.get("input", "").strip():
        prompt += ALPACA_INPUT_TEMPLATE.format(input=example['input'])

    prompt += ALPACA_RESPONSE_TEMPLATE.format(output=example['output'])

    tokenized = tokenizer(
        prompt,
        truncation=True,
        max_length=max_length,
        padding="max_length",
        return_tensors=None
    )

    tokenized["labels"] = tokenized["input_ids"].copy()

    return tokenized

I/O Contract

Parameter Type Description Default
example Dict[str, str] A dictionary with keys instruction, input (optional), output (required)
tokenizer AutoTokenizer HuggingFace tokenizer for the target model (required)
max_length int Maximum sequence length for truncation and padding 2048

Returns: Dict[str, Any] containing:

  • input_ids -- List of token IDs, length = max_length
  • attention_mask -- List of 0/1 values indicating real vs. padding tokens
  • labels -- Copy of input_ids for causal LM loss computation

Function: load_and_preprocess_dataset

Signature

def load_and_preprocess_dataset(
    dataset_name: str,
    dataset_percentage: float,
    tokenizer: AutoTokenizer,
    max_length: int,
    logger: logging.Logger
) -> Tuple[Any, DataLoader]:

Code Reference: File: training/DeepSpeed-SuperOffload/finetune_zero3.py, Lines 171-205

Description

Loads a HuggingFace dataset by name, optionally selects a percentage subset, tokenizes all examples using preprocess_alpaca_example, and wraps the result in a PyTorch DataLoader.

Implementation

def load_and_preprocess_dataset(
    dataset_name: str,
    dataset_percentage: float,
    tokenizer: AutoTokenizer,
    max_length: int,
    logger: logging.Logger
) -> Tuple[Any, DataLoader]:
    logger.debug(f"Loading dataset: {dataset_name}")

    dataset = load_dataset(dataset_name)
    original_size = len(dataset["train"])

    if dataset_percentage < 100.0:
        subset_size = int(original_size * dataset_percentage / 100.0)
        dataset["train"] = dataset["train"].select(range(subset_size))
        logger.debug(f"Using {dataset_percentage}% of dataset: "
                     f"{subset_size}/{original_size} examples")
    else:
        logger.debug(f"Using full dataset: {original_size} examples")

    logger.debug("Tokenizing dataset...")

    tokenized_dataset = dataset["train"].map(
        lambda x: preprocess_alpaca_example(x, tokenizer, max_length),
        batched=False,
        desc="Tokenizing"
    )

    train_dataloader = DataLoader(
        tokenized_dataset,
        batch_size=1,
        collate_fn=default_data_collator,
        shuffle=True
    )

    return tokenized_dataset, train_dataloader

I/O Contract

Parameter Type Description Default
dataset_name str HuggingFace dataset identifier (e.g., "tatsu-lab/alpaca") (required)
dataset_percentage float Percentage of the training split to use (1.0-100.0) (required)
tokenizer AutoTokenizer HuggingFace tokenizer for the target model (required)
max_length int Maximum sequence length for tokenization (required)
logger logging.Logger Logger instance for debug output (required)

Returns: Tuple[Any, DataLoader] containing:

  • Element 0 -- The tokenized HuggingFace Dataset object
  • Element 1 -- A PyTorch DataLoader with batch_size=1, shuffle=True, and default_data_collator

Template Constants

These constants are defined at the module level in finetune_zero3.py (Lines 55-57):

ALPACA_INSTRUCTION_TEMPLATE = "### Instruction:\n{instruction}\n\n"
ALPACA_INPUT_TEMPLATE = "### Input:\n{input}\n\n"
ALPACA_RESPONSE_TEMPLATE = "### Response:\n{output}"

Usage Example

from transformers import AutoTokenizer

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# Load and preprocess dataset (10% subset)
tokenized_dataset, train_dataloader = load_and_preprocess_dataset(
    dataset_name="tatsu-lab/alpaca",
    dataset_percentage=10.0,
    tokenizer=tokenizer,
    max_length=4096,
    logger=logger
)

# Iterate over batches
for batch in train_dataloader:
    # batch contains: input_ids, attention_mask, labels
    # Each tensor has shape [1, max_length]
    pass

Invocation in Main Script

In the main() function (Lines 248-250):

tokenized_dataset, train_dataloader = load_and_preprocess_dataset(
    args.dataset_name, args.dataset_percentage, tokenizer, args.max_length, logger
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment