Implementation:Microsoft DeepSpeedExamples Load And Preprocess Dataset
Metadata
| Field | Value |
|---|---|
| Page Type | Implementation |
| Title | Load_And_Preprocess_Dataset |
| Repository | Microsoft/DeepSpeedExamples |
| Type | Direct Function |
| Code Reference | File: training/DeepSpeed-SuperOffload/finetune_zero3.py, Lines 171-205
|
| Import | Direct functions in finetune_zero3.py
|
| Related Principle | Principle:Microsoft_DeepSpeedExamples_Instruction_Dataset_Preparation |
Overview
Concrete tool for loading and tokenizing Alpaca-format instruction datasets for SuperOffload fine-tuning. Provides two functions: preprocess_alpaca_example for per-example template formatting and tokenization, and load_and_preprocess_dataset for end-to-end dataset loading and DataLoader creation.
Function: preprocess_alpaca_example
Signature
def preprocess_alpaca_example(
example: Dict[str, str],
tokenizer: AutoTokenizer,
max_length: int = 2048
) -> Dict[str, Any]:
Code Reference: File: training/DeepSpeed-SuperOffload/finetune_zero3.py, Lines 81-103
Description
Formats a single Alpaca-format example using the instruction/input/response template, tokenizes it, and sets labels equal to input_ids for causal LM training.
Implementation
def preprocess_alpaca_example(
example: Dict[str, str],
tokenizer: AutoTokenizer,
max_length: int = 2048
) -> Dict[str, Any]:
prompt = ALPACA_INSTRUCTION_TEMPLATE.format(instruction=example['instruction'])
if example.get("input", "").strip():
prompt += ALPACA_INPUT_TEMPLATE.format(input=example['input'])
prompt += ALPACA_RESPONSE_TEMPLATE.format(output=example['output'])
tokenized = tokenizer(
prompt,
truncation=True,
max_length=max_length,
padding="max_length",
return_tensors=None
)
tokenized["labels"] = tokenized["input_ids"].copy()
return tokenized
I/O Contract
| Parameter | Type | Description | Default |
|---|---|---|---|
example |
Dict[str, str] |
A dictionary with keys instruction, input (optional), output |
(required) |
tokenizer |
AutoTokenizer |
HuggingFace tokenizer for the target model | (required) |
max_length |
int |
Maximum sequence length for truncation and padding | 2048 |
Returns: Dict[str, Any] containing:
input_ids-- List of token IDs, length =max_lengthattention_mask-- List of 0/1 values indicating real vs. padding tokenslabels-- Copy ofinput_idsfor causal LM loss computation
Function: load_and_preprocess_dataset
Signature
def load_and_preprocess_dataset(
dataset_name: str,
dataset_percentage: float,
tokenizer: AutoTokenizer,
max_length: int,
logger: logging.Logger
) -> Tuple[Any, DataLoader]:
Code Reference: File: training/DeepSpeed-SuperOffload/finetune_zero3.py, Lines 171-205
Description
Loads a HuggingFace dataset by name, optionally selects a percentage subset, tokenizes all examples using preprocess_alpaca_example, and wraps the result in a PyTorch DataLoader.
Implementation
def load_and_preprocess_dataset(
dataset_name: str,
dataset_percentage: float,
tokenizer: AutoTokenizer,
max_length: int,
logger: logging.Logger
) -> Tuple[Any, DataLoader]:
logger.debug(f"Loading dataset: {dataset_name}")
dataset = load_dataset(dataset_name)
original_size = len(dataset["train"])
if dataset_percentage < 100.0:
subset_size = int(original_size * dataset_percentage / 100.0)
dataset["train"] = dataset["train"].select(range(subset_size))
logger.debug(f"Using {dataset_percentage}% of dataset: "
f"{subset_size}/{original_size} examples")
else:
logger.debug(f"Using full dataset: {original_size} examples")
logger.debug("Tokenizing dataset...")
tokenized_dataset = dataset["train"].map(
lambda x: preprocess_alpaca_example(x, tokenizer, max_length),
batched=False,
desc="Tokenizing"
)
train_dataloader = DataLoader(
tokenized_dataset,
batch_size=1,
collate_fn=default_data_collator,
shuffle=True
)
return tokenized_dataset, train_dataloader
I/O Contract
| Parameter | Type | Description | Default |
|---|---|---|---|
dataset_name |
str |
HuggingFace dataset identifier (e.g., "tatsu-lab/alpaca") |
(required) |
dataset_percentage |
float |
Percentage of the training split to use (1.0-100.0) | (required) |
tokenizer |
AutoTokenizer |
HuggingFace tokenizer for the target model | (required) |
max_length |
int |
Maximum sequence length for tokenization | (required) |
logger |
logging.Logger |
Logger instance for debug output | (required) |
Returns: Tuple[Any, DataLoader] containing:
- Element 0 -- The tokenized HuggingFace
Datasetobject - Element 1 -- A PyTorch
DataLoaderwithbatch_size=1,shuffle=True, anddefault_data_collator
Template Constants
These constants are defined at the module level in finetune_zero3.py (Lines 55-57):
ALPACA_INSTRUCTION_TEMPLATE = "### Instruction:\n{instruction}\n\n"
ALPACA_INPUT_TEMPLATE = "### Input:\n{input}\n\n"
ALPACA_RESPONSE_TEMPLATE = "### Response:\n{output}"
Usage Example
from transformers import AutoTokenizer
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
# Load and preprocess dataset (10% subset)
tokenized_dataset, train_dataloader = load_and_preprocess_dataset(
dataset_name="tatsu-lab/alpaca",
dataset_percentage=10.0,
tokenizer=tokenizer,
max_length=4096,
logger=logger
)
# Iterate over batches
for batch in train_dataloader:
# batch contains: input_ids, attention_mask, labels
# Each tensor has shape [1, max_length]
pass
Invocation in Main Script
In the main() function (Lines 248-250):
tokenized_dataset, train_dataloader = load_and_preprocess_dataset(
args.dataset_name, args.dataset_percentage, tokenizer, args.max_length, logger
)