Implementation:Microsoft LoRA Pack Dataset

Overview

The pack_dataset.py script packs multiple short sequence-to-sequence examples into longer composite examples to improve training efficiency by reducing padding waste.

Description

In seq2seq tasks like summarization and translation, training datasets often contain many short examples that are padded to the maximum sequence length during batching, resulting in wasted computation. This script concatenates consecutive source-target pairs until the combined length reaches the token limit, creating fewer but longer training examples.

The packing logic works as follows:

Examples are processed sequentially (not sorted by length).
For each new example pair, the script checks whether appending it to the current accumulated example (separated by a space) would exceed max_tokens for either source or target.
If it would exceed the limit, the current accumulated example is finalized and a new one begins.
If it fits, the new text is appended with a space separator.
Validation and test splits are copied as-is (not packed), since packing is only beneficial for training.

Key functions:

pack_examples(tok, src_examples, tgt_examples, max_tokens=1024): Core packing logic. Uses the provided tokenizer to check sequence lengths. Returns packed source and target lists.
pack_data_dir(tok, data_dir, max_tokens, save_path): Processes the train split from {split}.source/{split}.target files. Copies val and test splits unchanged.
packer_cli(): CLI entry point that parses arguments, loads a tokenizer via AutoTokenizer, and calls pack_data_dir.

⚠️ DEPRECATED: This file resides in the legacy/ directory and is not actively maintained. Prefer modern equivalents where available.

Usage

Use this script when:

Training seq2seq models on datasets with many short examples (e.g., short sentence pairs in translation).
Wanting to reduce the number of training steps by combining multiple examples per batch element.
Pre-processing data for models like BART or T5 where the tokenizer supports long sequences.

Code Reference

Source Location

examples/NLU/examples/legacy/seq2seq/pack_dataset.py (88 lines)

Signature

def pack_examples(tok, src_examples: list, tgt_examples: list, max_tokens: int = 1024) -> tuple: ...
def pack_data_dir(tok, data_dir: Path, max_tokens: int, save_path: str) -> None: ...
def packer_cli() -> None: ...

Import / CLI Usage

python examples/legacy/seq2seq/pack_dataset.py \
    --tok_name facebook/bart-large-cnn \
    --max_seq_len 128 \
    --data_dir /path/to/data \
    --save_path /path/to/packed_data

I/O Contract

Inputs

Input	Type	Description
`--tok_name`	str	HuggingFace tokenizer name or path (e.g., `facebook/bart-large-cnn`, `t5-base`)
`--max_seq_len`	int	Maximum number of tokens per packed example. Default: 128
`--data_dir`	str	Directory containing `train.source`, `train.target`, `val.source`, `val.target`, `test.source`, `test.target`
`--save_path`	str	Output directory for packed dataset files

Outputs

Output	Type	Description
`{save_path}/train.source`	Text file	Packed training source examples (fewer lines than original)
`{save_path}/train.target`	Text file	Packed training target examples (aligned with source)
`{save_path}/val.source`, `{save_path}/val.target`	Text files	Copied unchanged from input
`{save_path}/test.source`, `{save_path}/test.target`	Text files	Copied unchanged from input
Console output	stdout	Reports packing ratio (e.g., "packed train split from 10000 examples -> 3500.")

Usage Examples

# Pack a summarization dataset for BART
python examples/legacy/seq2seq/pack_dataset.py \
    --tok_name facebook/bart-large-cnn \
    --max_seq_len 1024 \
    --data_dir ./cnn_dm/ \
    --save_path ./cnn_dm_packed/

# Output: packed train split from 287113 examples -> 95000.

# Pack a translation dataset for T5 with shorter sequences
python examples/legacy/seq2seq/pack_dataset.py \
    --tok_name t5-base \
    --max_seq_len 128 \
    --data_dir ./wmt_en_de/ \
    --save_path ./wmt_en_de_packed/

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment