Implementation:Huggingface Transformers Create Packed Sequences
| Knowledge Sources | |
|---|---|
| Domains | Data_Processing, Training |
| Last Updated | 2026-02-13 00:00 GMT |
Overview
Concrete pattern for packing variable-length tokenized sequences into fixed-length training blocks as used in the Hugging Face Transformers 3D parallel training example.
Description
This pattern implements the full data preparation pipeline for packed-sequence training in a distributed setting. It consists of three stages:
Stage 1 -- Tokenization: The raw text dataset is tokenized using the model's tokenizer with truncation to seq_len but without padding. Labels are set equal to input_ids for causal language modeling.
Stage 2 -- Packing: The create_packed_sequences function flattens all tokenized sequences into a single token stream, then slices it into blocks of seq_len + 1 tokens. For each block, the first seq_len tokens become input_ids and the last seq_len tokens (shifted by one) become labels. This is applied via dataset.map() with batched processing and multiprocessing for efficiency.
Stage 3 -- DataLoader construction: The packed dataset is shuffled, then wrapped in a DataLoader with a DistributedSampler that partitions examples across data-parallel ranks. The local batch size is computed as global_batch_size // dp_size. A custom collate_fn converts lists of dicts into batched tensors.
Usage
Use this pattern whenever preparing data for distributed causal language model training with fixed-length sequences. It is especially important when using context parallelism, which requires all sequences to have the same length for even sharding across the CP dimension.
Code Reference
Source Location
- Repository: transformers
- File:
examples/3D_parallel.py - Lines: 158-243
Signature
def create_packed_sequences(examples):
"""Pack variable-length tokenized sequences into fixed-length blocks."""
...
Import
from datasets import load_dataset
from torch.utils.data import DataLoader
from torch.utils.data.distributed import DistributedSampler
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| examples | dict | Yes | A batch of tokenized examples with "input_ids" key (list of lists of ints).
|
| seq_len | int | Yes (closure) | Target sequence length for packed blocks (e.g. 1024). |
Outputs
| Name | Type | Description |
|---|---|---|
| input_ids | list[list[int]] | Packed input token sequences, each of length seq_len.
|
| labels | list[list[int]] | Packed label sequences, each of length seq_len, shifted by one token from input_ids.
|
DataLoader Outputs
| Name | Type | Description |
|---|---|---|
| batch["input_ids"] | torch.Tensor | Shape (local_batch_size, seq_len), dtype torch.long.
|
| batch["labels"] | torch.Tensor | Shape (local_batch_size, seq_len), dtype torch.long.
|
Usage Examples
Basic Usage
from datasets import load_dataset
from torch.utils.data import DataLoader
from torch.utils.data.distributed import DistributedSampler
seq_len = 1024
# Step 1: Load and tokenize
raw_dataset = load_dataset("roneneldan/TinyStories", split="train[:1%]")
def tokenize_function(examples):
tokenized_batch = tokenizer(
examples["text"], padding=False, truncation=True,
max_length=seq_len, return_tensors=None
)
tokenized_batch["labels"] = tokenized_batch["input_ids"].copy()
return tokenized_batch
tokenized_dataset = raw_dataset.map(tokenize_function, batched=True, remove_columns=["text"])
# Step 2: Pack sequences
def create_packed_sequences(examples):
all_tokens = []
for input_ids in examples["input_ids"]:
all_tokens.extend(input_ids)
num_sequences = len(all_tokens) // (seq_len + 1)
packed_input_ids = []
packed_labels = []
for i in range(num_sequences):
start_idx = i * (seq_len + 1)
end_idx = start_idx + (seq_len + 1)
full_sequence = all_tokens[start_idx:end_idx]
packed_input_ids.append(full_sequence[:-1])
packed_labels.append(full_sequence[1:])
return {"input_ids": packed_input_ids, "labels": packed_labels}
packed_dataset = tokenized_dataset.map(
create_packed_sequences, batched=True,
remove_columns=tokenized_dataset.column_names,
batch_size=1000, num_proc=60,
)
packed_dataset = packed_dataset.shuffle(seed=42)
# Step 3: Create DataLoader
local_batch_size = global_batch_size // dp_mesh.size()
def collate_fn(batch):
input_ids = torch.tensor([item["input_ids"] for item in batch], dtype=torch.long)
labels = torch.tensor([item["labels"] for item in batch], dtype=torch.long)
return {"input_ids": input_ids, "labels": labels}
sampler = DistributedSampler(
packed_dataset, num_replicas=dp_mesh.size(),
rank=dp_mesh.get_local_rank(), shuffle=False,
)
dataloader = DataLoader(
packed_dataset, batch_size=local_batch_size,
sampler=sampler, shuffle=False,
collate_fn=collate_fn, pin_memory=True,
)