Implementation:Huggingface Transformers Create Packed Sequences

Knowledge Sources	Transformers Hugging Face Datasets
Domains	Data_Processing, Training
Last Updated	2026-02-13 00:00 GMT

Overview

Concrete pattern for packing variable-length tokenized sequences into fixed-length training blocks as used in the Hugging Face Transformers 3D parallel training example.

Description

This pattern implements the full data preparation pipeline for packed-sequence training in a distributed setting. It consists of three stages:

Stage 1 -- Tokenization: The raw text dataset is tokenized using the model's tokenizer with truncation to seq_len but without padding. Labels are set equal to input_ids for causal language modeling.

Stage 2 -- Packing: The create_packed_sequences function flattens all tokenized sequences into a single token stream, then slices it into blocks of seq_len + 1 tokens. For each block, the first seq_len tokens become input_ids and the last seq_len tokens (shifted by one) become labels. This is applied via dataset.map() with batched processing and multiprocessing for efficiency.

Stage 3 -- DataLoader construction: The packed dataset is shuffled, then wrapped in a DataLoader with a DistributedSampler that partitions examples across data-parallel ranks. The local batch size is computed as global_batch_size // dp_size. A custom collate_fn converts lists of dicts into batched tensors.

Usage

Use this pattern whenever preparing data for distributed causal language model training with fixed-length sequences. It is especially important when using context parallelism, which requires all sequences to have the same length for even sharding across the CP dimension.

Code Reference

Source Location

Repository: transformers
File: examples/3D_parallel.py
Lines: 158-243

Signature

def create_packed_sequences(examples):
    """Pack variable-length tokenized sequences into fixed-length blocks."""
    ...

Import

from datasets import load_dataset
from torch.utils.data import DataLoader
from torch.utils.data.distributed import DistributedSampler

I/O Contract

Inputs

Name	Type	Required	Description
examples	dict	Yes	A batch of tokenized examples with `"input_ids"` key (list of lists of ints).
seq_len	int	Yes (closure)	Target sequence length for packed blocks (e.g. 1024).

Outputs

Name	Type	Description
input_ids	list[list[int]]	Packed input token sequences, each of length `seq_len`.
labels	list[list[int]]	Packed label sequences, each of length `seq_len`, shifted by one token from input_ids.

DataLoader Outputs

Name	Type	Description
batch["input_ids"]	torch.Tensor	Shape `(local_batch_size, seq_len)`, dtype `torch.long`.
batch["labels"]	torch.Tensor	Shape `(local_batch_size, seq_len)`, dtype `torch.long`.

Usage Examples

Basic Usage

from datasets import load_dataset
from torch.utils.data import DataLoader
from torch.utils.data.distributed import DistributedSampler

seq_len = 1024

# Step 1: Load and tokenize
raw_dataset = load_dataset("roneneldan/TinyStories", split="train[:1%]")

def tokenize_function(examples):
    tokenized_batch = tokenizer(
        examples["text"], padding=False, truncation=True,
        max_length=seq_len, return_tensors=None
    )
    tokenized_batch["labels"] = tokenized_batch["input_ids"].copy()
    return tokenized_batch

tokenized_dataset = raw_dataset.map(tokenize_function, batched=True, remove_columns=["text"])

# Step 2: Pack sequences
def create_packed_sequences(examples):
    all_tokens = []
    for input_ids in examples["input_ids"]:
        all_tokens.extend(input_ids)

    num_sequences = len(all_tokens) // (seq_len + 1)
    packed_input_ids = []
    packed_labels = []

    for i in range(num_sequences):
        start_idx = i * (seq_len + 1)
        end_idx = start_idx + (seq_len + 1)
        full_sequence = all_tokens[start_idx:end_idx]
        packed_input_ids.append(full_sequence[:-1])
        packed_labels.append(full_sequence[1:])

    return {"input_ids": packed_input_ids, "labels": packed_labels}

packed_dataset = tokenized_dataset.map(
    create_packed_sequences, batched=True,
    remove_columns=tokenized_dataset.column_names,
    batch_size=1000, num_proc=60,
)
packed_dataset = packed_dataset.shuffle(seed=42)

# Step 3: Create DataLoader
local_batch_size = global_batch_size // dp_mesh.size()

def collate_fn(batch):
    input_ids = torch.tensor([item["input_ids"] for item in batch], dtype=torch.long)
    labels = torch.tensor([item["labels"] for item in batch], dtype=torch.long)
    return {"input_ids": input_ids, "labels": labels}

sampler = DistributedSampler(
    packed_dataset, num_replicas=dp_mesh.size(),
    rank=dp_mesh.get_local_rank(), shuffle=False,
)

dataloader = DataLoader(
    packed_dataset, batch_size=local_batch_size,
    sampler=sampler, shuffle=False,
    collate_fn=collate_fn, pin_memory=True,
)

Related Pages

Implements Principle

Principle:Huggingface_Transformers_Sequence_Packing

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment