Implementation:Bigscience workshop Petals Dataset Loading Pipeline
Appearance
| Knowledge Sources | |
|---|---|
| Domains | NLP, Data_Engineering, Training |
| Last Updated | 2026-02-09 14:00 GMT |
Overview
Concrete tool for loading, tokenizing, and batching text datasets using HuggingFace Datasets, Transformers tokenizers, and PyTorch DataLoader, as used in Petals training workflows.
Description
The data preparation pipeline combines three external libraries:
- datasets.load_dataset(): Downloads datasets from HuggingFace Hub with automatic caching
- tokenizer(): Converts text to token IDs with padding and truncation
- DataLoader: Creates efficient batched iterators with shuffling and collation
In Petals examples (prompt-tuning-sst2.ipynb and prompt-tuning-personachat.ipynb), this pipeline is used to prepare data for prompt tuning training with distributed models.
Usage
Use these external APIs together when setting up data for any Petals training workflow. The dataset and tokenization parameters vary by task.
Code Reference
Source Location
- Repository: External (datasets, transformers, torch)
- File: External: datasets.load_dataset, transformers.PreTrainedTokenizer.__call__, torch.utils.data.DataLoader
Signature
# datasets
def load_dataset(
path: str,
name: Optional[str] = None,
split: Optional[str] = None,
**kwargs,
) -> Dataset:
"""Load a dataset from HuggingFace Hub."""
# transformers tokenizer
class PreTrainedTokenizer:
def __call__(
self,
text: Union[str, List[str]],
padding: Union[bool, str] = False,
max_length: Optional[int] = None,
truncation: bool = False,
return_tensors: Optional[str] = None,
**kwargs,
) -> BatchEncoding:
"""Tokenize text input(s)."""
# torch DataLoader
class DataLoader:
def __init__(
self,
dataset: Dataset,
batch_size: int = 1,
shuffle: bool = False,
collate_fn: Optional[Callable] = None,
**kwargs,
):
"""Create batched data iterator."""
Import
from datasets import load_dataset
from transformers import AutoTokenizer
from torch.utils.data import DataLoader
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| path | str | Yes | Dataset name (e.g. "glue", "bavard/personachat_truecased") |
| name | str | No | Task subset name (e.g. "sst2") |
| tokenizer | PreTrainedTokenizer | Yes | Tokenizer from same model as the distributed model |
| max_length | int | Yes | Maximum token sequence length |
| batch_size | int | Yes | Training batch size |
Outputs
| Name | Type | Description |
|---|---|---|
| dataloader | DataLoader | Yields batches with input_ids, attention_mask, and labels tensors |
Usage Examples
SST-2 Classification Data
from datasets import load_dataset
from transformers import AutoTokenizer
from torch.utils.data import DataLoader
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Load and tokenize SST-2
dataset = load_dataset("glue", "sst2", split="train")
def tokenize_fn(examples):
return tokenizer(
examples["sentence"],
padding="max_length",
max_length=128,
truncation=True,
)
tokenized = dataset.map(tokenize_fn, batched=True)
tokenized = tokenized.rename_column("label", "labels")
tokenized.set_format("torch", columns=["input_ids", "attention_mask", "labels"])
train_loader = DataLoader(tokenized, batch_size=32, shuffle=True)
PersonaChat Dialogue Data
from datasets import load_dataset
dataset = load_dataset("bavard/personachat_truecased", split="train")
def prepare_dialogue(examples):
# Concatenate persona and dialogue turns
texts = []
for persona, history, candidates in zip(
examples["personality"], examples["history"], examples["candidates"]
):
context = " ".join(persona) + " " + " ".join(history)
response = candidates[-1] # Last candidate is the gold response
texts.append(context + " " + response)
encodings = tokenizer(texts, padding="max_length", max_length=256, truncation=True)
encodings["labels"] = encodings["input_ids"].copy()
return encodings
tokenized = dataset.map(prepare_dialogue, batched=True)
Related Pages
Implements Principle
Requires Environment
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment