Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Microsoft LoRA Pack Dataset

From Leeroopedia
Revision as of 15:43, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Microsoft_LoRA_Pack_Dataset.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Template:Implementation meta

Overview

The pack_dataset.py script packs multiple short sequence-to-sequence examples into longer composite examples to improve training efficiency by reducing padding waste.

Description

In seq2seq tasks like summarization and translation, training datasets often contain many short examples that are padded to the maximum sequence length during batching, resulting in wasted computation. This script concatenates consecutive source-target pairs until the combined length reaches the token limit, creating fewer but longer training examples.

The packing logic works as follows:

  1. Examples are processed sequentially (not sorted by length).
  2. For each new example pair, the script checks whether appending it to the current accumulated example (separated by a space) would exceed max_tokens for either source or target.
  3. If it would exceed the limit, the current accumulated example is finalized and a new one begins.
  4. If it fits, the new text is appended with a space separator.
  5. Validation and test splits are copied as-is (not packed), since packing is only beneficial for training.

Key functions:

  • pack_examples(tok, src_examples, tgt_examples, max_tokens=1024): Core packing logic. Uses the provided tokenizer to check sequence lengths. Returns packed source and target lists.
  • pack_data_dir(tok, data_dir, max_tokens, save_path): Processes the train split from {split}.source/{split}.target files. Copies val and test splits unchanged.
  • packer_cli(): CLI entry point that parses arguments, loads a tokenizer via AutoTokenizer, and calls pack_data_dir.

⚠️ DEPRECATED: This file resides in the legacy/ directory and is not actively maintained. Prefer modern equivalents where available.

Usage

Use this script when:

  • Training seq2seq models on datasets with many short examples (e.g., short sentence pairs in translation).
  • Wanting to reduce the number of training steps by combining multiple examples per batch element.
  • Pre-processing data for models like BART or T5 where the tokenizer supports long sequences.

Code Reference

Source Location

examples/NLU/examples/legacy/seq2seq/pack_dataset.py (88 lines)

Signature

def pack_examples(tok, src_examples: list, tgt_examples: list, max_tokens: int = 1024) -> tuple: ...
def pack_data_dir(tok, data_dir: Path, max_tokens: int, save_path: str) -> None: ...
def packer_cli() -> None: ...

Import / CLI Usage

python examples/legacy/seq2seq/pack_dataset.py \
    --tok_name facebook/bart-large-cnn \
    --max_seq_len 128 \
    --data_dir /path/to/data \
    --save_path /path/to/packed_data

I/O Contract

Inputs

Input Type Description
--tok_name str HuggingFace tokenizer name or path (e.g., facebook/bart-large-cnn, t5-base)
--max_seq_len int Maximum number of tokens per packed example. Default: 128
--data_dir str Directory containing train.source, train.target, val.source, val.target, test.source, test.target
--save_path str Output directory for packed dataset files

Outputs

Output Type Description
{save_path}/train.source Text file Packed training source examples (fewer lines than original)
{save_path}/train.target Text file Packed training target examples (aligned with source)
{save_path}/val.source, {save_path}/val.target Text files Copied unchanged from input
{save_path}/test.source, {save_path}/test.target Text files Copied unchanged from input
Console output stdout Reports packing ratio (e.g., "packed train split from 10000 examples -> 3500.")

Usage Examples

# Pack a summarization dataset for BART
python examples/legacy/seq2seq/pack_dataset.py \
    --tok_name facebook/bart-large-cnn \
    --max_seq_len 1024 \
    --data_dir ./cnn_dm/ \
    --save_path ./cnn_dm_packed/

# Output: packed train split from 287113 examples -> 95000.

# Pack a translation dataset for T5 with shorter sequences
python examples/legacy/seq2seq/pack_dataset.py \
    --tok_name t5-base \
    --max_seq_len 128 \
    --data_dir ./wmt_en_de/ \
    --save_path ./wmt_en_de_packed/

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment