Principle:Bigscience workshop Petals Data Preparation
| Knowledge Sources | |
|---|---|
| Domains | NLP, Data_Engineering, Training |
| Last Updated | 2026-02-09 14:00 GMT |
Overview
The pipeline of loading a text dataset, tokenizing it with the model's tokenizer, and creating batched data loaders for efficient training or evaluation.
Description
Data Preparation transforms raw text datasets into tokenized tensor batches suitable for training or evaluating distributed Petals models. The pipeline consists of:
- Dataset loading: Using HuggingFace datasets.load_dataset() to download and cache datasets
- Tokenization: Applying the model's tokenizer to convert text to input_ids and attention_mask tensors
- Batching: Creating PyTorch DataLoader instances for efficient batched iteration
Key considerations for Petals:
- Max length: Must match the model's effective context window (accounting for prompt tuning prefix tokens)
- Padding: "max_length" padding ensures uniform tensor shapes for batching
- Label preparation: For classification, labels are integer class indices; for causal LM (chatbot), labels are shifted input_ids with padding tokens masked to -100
Usage
Use this principle when preparing training or evaluation data for any Petals fine-tuning workflow. The specific dataset and tokenization settings depend on the task (classification, dialogue, etc.).
Theoretical Basis
Tokenization pipeline:
# Abstract data preparation pipeline
dataset = load_dataset("glue", "sst2", split="train")
def tokenize(examples):
return tokenizer(
examples["sentence"],
padding="max_length",
max_length=128,
truncation=True,
)
tokenized = dataset.map(tokenize, batched=True)
tokenized.set_format("torch", columns=["input_ids", "attention_mask", "label"])
dataloader = DataLoader(tokenized, batch_size=32, shuffle=True)
For causal LM (dialogue):
- Input: concatenated dialogue turns
- Labels: same as input_ids but with padding positions set to -100 (ignored in loss)