Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Bigscience workshop Petals Dataset Loading Pipeline

From Leeroopedia


Knowledge Sources
Domains NLP, Data_Engineering, Training
Last Updated 2026-02-09 14:00 GMT

Overview

Concrete tool for loading, tokenizing, and batching text datasets using HuggingFace Datasets, Transformers tokenizers, and PyTorch DataLoader, as used in Petals training workflows.

Description

The data preparation pipeline combines three external libraries:

  • datasets.load_dataset(): Downloads datasets from HuggingFace Hub with automatic caching
  • tokenizer(): Converts text to token IDs with padding and truncation
  • DataLoader: Creates efficient batched iterators with shuffling and collation

In Petals examples (prompt-tuning-sst2.ipynb and prompt-tuning-personachat.ipynb), this pipeline is used to prepare data for prompt tuning training with distributed models.

Usage

Use these external APIs together when setting up data for any Petals training workflow. The dataset and tokenization parameters vary by task.

Code Reference

Source Location

  • Repository: External (datasets, transformers, torch)
  • File: External: datasets.load_dataset, transformers.PreTrainedTokenizer.__call__, torch.utils.data.DataLoader

Signature

# datasets
def load_dataset(
    path: str,
    name: Optional[str] = None,
    split: Optional[str] = None,
    **kwargs,
) -> Dataset:
    """Load a dataset from HuggingFace Hub."""

# transformers tokenizer
class PreTrainedTokenizer:
    def __call__(
        self,
        text: Union[str, List[str]],
        padding: Union[bool, str] = False,
        max_length: Optional[int] = None,
        truncation: bool = False,
        return_tensors: Optional[str] = None,
        **kwargs,
    ) -> BatchEncoding:
        """Tokenize text input(s)."""

# torch DataLoader
class DataLoader:
    def __init__(
        self,
        dataset: Dataset,
        batch_size: int = 1,
        shuffle: bool = False,
        collate_fn: Optional[Callable] = None,
        **kwargs,
    ):
        """Create batched data iterator."""

Import

from datasets import load_dataset
from transformers import AutoTokenizer
from torch.utils.data import DataLoader

I/O Contract

Inputs

Name Type Required Description
path str Yes Dataset name (e.g. "glue", "bavard/personachat_truecased")
name str No Task subset name (e.g. "sst2")
tokenizer PreTrainedTokenizer Yes Tokenizer from same model as the distributed model
max_length int Yes Maximum token sequence length
batch_size int Yes Training batch size

Outputs

Name Type Description
dataloader DataLoader Yields batches with input_ids, attention_mask, and labels tensors

Usage Examples

SST-2 Classification Data

from datasets import load_dataset
from transformers import AutoTokenizer
from torch.utils.data import DataLoader

tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load and tokenize SST-2
dataset = load_dataset("glue", "sst2", split="train")

def tokenize_fn(examples):
    return tokenizer(
        examples["sentence"],
        padding="max_length",
        max_length=128,
        truncation=True,
    )

tokenized = dataset.map(tokenize_fn, batched=True)
tokenized = tokenized.rename_column("label", "labels")
tokenized.set_format("torch", columns=["input_ids", "attention_mask", "labels"])

train_loader = DataLoader(tokenized, batch_size=32, shuffle=True)

PersonaChat Dialogue Data

from datasets import load_dataset

dataset = load_dataset("bavard/personachat_truecased", split="train")

def prepare_dialogue(examples):
    # Concatenate persona and dialogue turns
    texts = []
    for persona, history, candidates in zip(
        examples["personality"], examples["history"], examples["candidates"]
    ):
        context = " ".join(persona) + " " + " ".join(history)
        response = candidates[-1]  # Last candidate is the gold response
        texts.append(context + " " + response)

    encodings = tokenizer(texts, padding="max_length", max_length=256, truncation=True)
    encodings["labels"] = encodings["input_ids"].copy()
    return encodings

tokenized = dataset.map(prepare_dialogue, batched=True)

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment