Principle:Bigscience workshop Petals Data Preparation

Knowledge Sources	HuggingFace Datasets PyTorch DataLoader
Domains	NLP, Data_Engineering, Training
Last Updated	2026-02-09 14:00 GMT

Overview

The pipeline of loading a text dataset, tokenizing it with the model's tokenizer, and creating batched data loaders for efficient training or evaluation.

Description

Data Preparation transforms raw text datasets into tokenized tensor batches suitable for training or evaluating distributed Petals models. The pipeline consists of:

Dataset loading: Using HuggingFace datasets.load_dataset() to download and cache datasets
Tokenization: Applying the model's tokenizer to convert text to input_ids and attention_mask tensors
Batching: Creating PyTorch DataLoader instances for efficient batched iteration

Key considerations for Petals:

Max length: Must match the model's effective context window (accounting for prompt tuning prefix tokens)
Padding: "max_length" padding ensures uniform tensor shapes for batching
Label preparation: For classification, labels are integer class indices; for causal LM (chatbot), labels are shifted input_ids with padding tokens masked to -100

Usage

Use this principle when preparing training or evaluation data for any Petals fine-tuning workflow. The specific dataset and tokenization settings depend on the task (classification, dialogue, etc.).

Theoretical Basis

Tokenization pipeline:

# Abstract data preparation pipeline
dataset = load_dataset("glue", "sst2", split="train")

def tokenize(examples):
    return tokenizer(
        examples["sentence"],
        padding="max_length",
        max_length=128,
        truncation=True,
    )

tokenized = dataset.map(tokenize, batched=True)
tokenized.set_format("torch", columns=["input_ids", "attention_mask", "label"])

dataloader = DataLoader(tokenized, batch_size=32, shuffle=True)

For causal LM (dialogue):

Input: concatenated dialogue turns
Labels: same as input_ids but with padding positions set to -100 (ignored in loss)

Related Pages

Implemented By

Implementation:Bigscience_workshop_Petals_Dataset_Loading_Pipeline

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment