Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Transformers Create Packed Sequences

From Leeroopedia
Knowledge Sources
Domains Data_Processing, Training
Last Updated 2026-02-13 00:00 GMT

Overview

Concrete pattern for packing variable-length tokenized sequences into fixed-length training blocks as used in the Hugging Face Transformers 3D parallel training example.

Description

This pattern implements the full data preparation pipeline for packed-sequence training in a distributed setting. It consists of three stages:

Stage 1 -- Tokenization: The raw text dataset is tokenized using the model's tokenizer with truncation to seq_len but without padding. Labels are set equal to input_ids for causal language modeling.

Stage 2 -- Packing: The create_packed_sequences function flattens all tokenized sequences into a single token stream, then slices it into blocks of seq_len + 1 tokens. For each block, the first seq_len tokens become input_ids and the last seq_len tokens (shifted by one) become labels. This is applied via dataset.map() with batched processing and multiprocessing for efficiency.

Stage 3 -- DataLoader construction: The packed dataset is shuffled, then wrapped in a DataLoader with a DistributedSampler that partitions examples across data-parallel ranks. The local batch size is computed as global_batch_size // dp_size. A custom collate_fn converts lists of dicts into batched tensors.

Usage

Use this pattern whenever preparing data for distributed causal language model training with fixed-length sequences. It is especially important when using context parallelism, which requires all sequences to have the same length for even sharding across the CP dimension.

Code Reference

Source Location

  • Repository: transformers
  • File: examples/3D_parallel.py
  • Lines: 158-243

Signature

def create_packed_sequences(examples):
    """Pack variable-length tokenized sequences into fixed-length blocks."""
    ...

Import

from datasets import load_dataset
from torch.utils.data import DataLoader
from torch.utils.data.distributed import DistributedSampler

I/O Contract

Inputs

Name Type Required Description
examples dict Yes A batch of tokenized examples with "input_ids" key (list of lists of ints).
seq_len int Yes (closure) Target sequence length for packed blocks (e.g. 1024).

Outputs

Name Type Description
input_ids list[list[int]] Packed input token sequences, each of length seq_len.
labels list[list[int]] Packed label sequences, each of length seq_len, shifted by one token from input_ids.

DataLoader Outputs

Name Type Description
batch["input_ids"] torch.Tensor Shape (local_batch_size, seq_len), dtype torch.long.
batch["labels"] torch.Tensor Shape (local_batch_size, seq_len), dtype torch.long.

Usage Examples

Basic Usage

from datasets import load_dataset
from torch.utils.data import DataLoader
from torch.utils.data.distributed import DistributedSampler

seq_len = 1024

# Step 1: Load and tokenize
raw_dataset = load_dataset("roneneldan/TinyStories", split="train[:1%]")

def tokenize_function(examples):
    tokenized_batch = tokenizer(
        examples["text"], padding=False, truncation=True,
        max_length=seq_len, return_tensors=None
    )
    tokenized_batch["labels"] = tokenized_batch["input_ids"].copy()
    return tokenized_batch

tokenized_dataset = raw_dataset.map(tokenize_function, batched=True, remove_columns=["text"])

# Step 2: Pack sequences
def create_packed_sequences(examples):
    all_tokens = []
    for input_ids in examples["input_ids"]:
        all_tokens.extend(input_ids)

    num_sequences = len(all_tokens) // (seq_len + 1)
    packed_input_ids = []
    packed_labels = []

    for i in range(num_sequences):
        start_idx = i * (seq_len + 1)
        end_idx = start_idx + (seq_len + 1)
        full_sequence = all_tokens[start_idx:end_idx]
        packed_input_ids.append(full_sequence[:-1])
        packed_labels.append(full_sequence[1:])

    return {"input_ids": packed_input_ids, "labels": packed_labels}

packed_dataset = tokenized_dataset.map(
    create_packed_sequences, batched=True,
    remove_columns=tokenized_dataset.column_names,
    batch_size=1000, num_proc=60,
)
packed_dataset = packed_dataset.shuffle(seed=42)

# Step 3: Create DataLoader
local_batch_size = global_batch_size // dp_mesh.size()

def collate_fn(batch):
    input_ids = torch.tensor([item["input_ids"] for item in batch], dtype=torch.long)
    labels = torch.tensor([item["labels"] for item in batch], dtype=torch.long)
    return {"input_ids": input_ids, "labels": labels}

sampler = DistributedSampler(
    packed_dataset, num_replicas=dp_mesh.size(),
    rank=dp_mesh.get_local_rank(), shuffle=False,
)

dataloader = DataLoader(
    packed_dataset, batch_size=local_batch_size,
    sampler=sampler, shuffle=False,
    collate_fn=collate_fn, pin_memory=True,
)

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment