Implementation:Microsoft LoRA Legacy Seq2Seq Utils
Template:Implementation metadata
Overview
Comprehensive utility module for seq2seq fine-tuning providing dataset classes, data collators, custom samplers, loss functions, metric computation (ROUGE, BLEU), and parameter freezing helpers.
Description
utils.py is a utility library in the legacy seq2seq examples directory of the Microsoft LoRA NLU repository. It provides the full data processing and evaluation infrastructure for seq2seq fine-tuning of summarization and translation models. The module contains:
- Dataset classes:
AbstractSeq2SeqDataset(base class with sortish sampling and dynamic batching),Seq2SeqDataset(usingprepare_seq2seq_batch), andLegacySeq2SeqDataset(using manual tokenization). All datasets read from paired.source/.targetline-indexed text files vialinecache. - Data collator:
Seq2SeqDataCollatorthat handles tokenization, padding, trimming, and decoder input preparation with support for T5-style and BART-style right-shifting of labels. - Custom samplers:
SortishSampler(sorts by source length with randomness for efficiency) andDistributedSortishSampler(distributed-aware variant). - Loss functions:
label_smoothed_nll_loss()adapted from fairseq for label smoothing regularization. - Metrics:
calculate_rouge()usingrouge_scorerwith bootstrap aggregation, andcalculate_bleu()usingsacrebleu. - Parameter freezing:
freeze_params(),freeze_embeds()(model-type-aware for T5/mT5, FSMT, and BART),assert_all_frozen(),assert_not_all_frozen(). - I/O utilities:
save_json(),load_json(),pickle_load(),pickle_save(),write_txt_file(),save_git_info().
This module is part of the HuggingFace Transformers library (legacy examples) bundled in the Microsoft LoRA repository.
⚠️ DEPRECATED: This file resides in the legacy/ directory and is not actively maintained. Prefer modern equivalents where available.
Usage
Use this module's classes and functions when building seq2seq training pipelines for summarization or translation. Import Seq2SeqDataset and Seq2SeqDataCollator for data loading, build_compute_metrics_fn for evaluation metrics, and the freezing utilities for transfer learning with frozen components.
Code Reference
Source Location
| Property | Value |
|---|---|
| File path | examples/NLU/examples/legacy/seq2seq/utils.py
|
| Lines | 664 |
| Module | utils (within seq2seq directory)
|
Key Classes
| Name | Type | Signature / Description |
|---|---|---|
AbstractSeq2SeqDataset |
class | __init__(self, tokenizer, data_dir, max_source_length, max_target_length, type_path="train", n_obs=None, prefix="", **dataset_kwargs) -- base class with sortish/dynamic sampling
|
Seq2SeqDataset |
class | Returns {"tgt_texts": str, "src_texts": str, "id": int} per item; uses prepare_seq2seq_batch in collate_fn
|
LegacySeq2SeqDataset |
class | Returns {"input_ids": Tensor, "attention_mask": Tensor, "labels": Tensor} per item; manual tokenization with encode_line
|
Seq2SeqDataCollator |
class | __init__(self, tokenizer, data_args, decoder_start_token_id, tpu_num_cores=None) -- callable collator handling T5 and BART decoder input preparation
|
SortishSampler |
class | __init__(self, data, batch_size, shuffle=True) -- sorts by source length with randomness for efficient batching
|
DistributedSortishSampler |
class | __init__(self, dataset, batch_size, num_replicas=None, rank=None, add_extra_examples=True, shuffle=True)
|
Key Functions
| Name | Signature | Description |
|---|---|---|
label_smoothed_nll_loss |
label_smoothed_nll_loss(lprobs, target, epsilon, ignore_index=-100) |
Label smoothing loss from fairseq; returns (smoothed_loss, nll_loss)
|
calculate_rouge |
calculate_rouge(pred_lns, tgt_lns, use_stemmer=True, rouge_keys=ROUGE_KEYS, return_precision_and_recall=False, bootstrap_aggregation=True, newline_sep=True) |
Computes ROUGE-1, ROUGE-2, ROUGE-L, ROUGE-Lsum scores |
calculate_bleu |
calculate_bleu(output_lns, refs_lns, **kwargs) |
Computes corpus BLEU score using sacrebleu |
build_compute_metrics_fn |
build_compute_metrics_fn(task_name, tokenizer) |
Returns a metrics function: ROUGE for summarization, BLEU for translation |
trim_batch |
trim_batch(input_ids, pad_token_id, attention_mask=None) |
Removes columns that are entirely padding |
freeze_params |
freeze_params(model: nn.Module) |
Sets requires_grad=False for all parameters
|
freeze_embeds |
freeze_embeds(model) |
Freezes embedding layers based on model type (T5/mT5, FSMT, BART) |
assert_all_frozen |
assert_all_frozen(model) |
Asserts no parameters require gradients |
use_task_specific_params |
use_task_specific_params(model, task) |
Updates model config with task-specific params from model.config.task_specific_params
|
lmap |
lmap(f, x) |
Shorthand for list(map(f, x))
|
save_json |
save_json(content, path, indent=4, **json_dump_kwargs) |
Writes JSON to file |
check_output_dir |
check_output_dir(args, expected_items=0) |
Validates output directory state before training |
Import Usage
from utils import (
Seq2SeqDataCollator,
Seq2SeqDataset,
assert_all_frozen,
build_compute_metrics_fn,
check_output_dir,
freeze_embeds,
freeze_params,
label_smoothed_nll_loss,
calculate_rouge,
calculate_bleu,
lmap,
save_json,
use_task_specific_params,
write_txt_file,
)
I/O Contract
Inputs (Seq2SeqDataset)
| Input | Type | Description |
|---|---|---|
tokenizer |
PreTrainedTokenizer |
HuggingFace tokenizer for encoding source and target |
data_dir |
str |
Directory containing {type_path}.source and {type_path}.target files
|
max_source_length |
int |
Maximum source tokenization length |
max_target_length |
int |
Maximum target tokenization length |
type_path |
str (default "train") |
Split name: "train", "val", or "test"
|
n_obs |
Optional[int] |
Limit number of observations (None for all) |
prefix |
str |
Prefix to prepend to source lines (e.g., "summarize: " for T5)
|
Inputs (label_smoothed_nll_loss)
| Input | Type | Description |
|---|---|---|
lprobs |
Tensor |
Log probabilities from model output, shape (batch, seq_len, vocab_size)
|
target |
Tensor |
Target token IDs, shape (batch, seq_len)
|
epsilon |
float |
Label smoothing factor |
ignore_index |
int (default -100) |
Index to ignore in loss computation (padding) |
Outputs
| Output | Type | Description |
|---|---|---|
Seq2SeqDataset[i] |
Dict[str, str] |
{"tgt_texts": str, "src_texts": str, "id": int}
|
Seq2SeqDataCollator(batch) |
Dict[str, Tensor] |
{"input_ids", "attention_mask", "decoder_input_ids", "labels"}
|
label_smoothed_nll_loss() |
Tuple[Tensor, Tensor] |
(smoothed_loss, nll_loss)
|
calculate_rouge() |
Dict[str, float] |
{"rouge1": float, "rouge2": float, "rougeL": float, "rougeLsum": float}
|
calculate_bleu() |
Dict[str, float] |
{"bleu": float}
|
Usage Examples
Creating a Seq2SeqDataset
from transformers import AutoTokenizer
from utils import Seq2SeqDataset
tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")
dataset = Seq2SeqDataset(
tokenizer=tokenizer,
data_dir="/data/cnn_dm/",
max_source_length=1024,
max_target_length=128,
type_path="train",
prefix="",
)
print(f"Dataset size: {len(dataset)}")
print(f"First example: {dataset[0]}")
Computing ROUGE Metrics
from utils import calculate_rouge
predictions = ["The cat sat on the mat.", "Dogs are loyal animals."]
references = ["A cat was sitting on a mat.", "Dogs are very loyal pets."]
scores = calculate_rouge(predictions, references)
print(scores)
# {"rouge1": 65.2, "rouge2": 33.1, "rougeL": 60.0, "rougeLsum": 60.0}
Using Label Smoothing Loss
import torch
from utils import label_smoothed_nll_loss
# lprobs: (batch_size, seq_len, vocab_size)
lprobs = torch.randn(2, 10, 50265).log_softmax(dim=-1)
target = torch.randint(0, 50265, (2, 10))
loss, nll_loss = label_smoothed_nll_loss(lprobs, target, epsilon=0.1)
print(f"Smoothed loss: {loss.item()}, NLL loss: {nll_loss.item()}")
Freezing Embeddings
from transformers import AutoModelForSeq2SeqLM
from utils import freeze_embeds, assert_all_frozen
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/bart-large")
freeze_embeds(model)
# Verify encoder embeddings are frozen
# assert_all_frozen(model.model.encoder.embed_tokens)