Implementation:Microsoft LoRA Seq2Seq Trainer

Overview

The seq2seq_trainer.py module provides a custom Seq2SeqTrainer class that extends the HuggingFace Trainer with label smoothing, pad token loss exclusion, custom optimizer/scheduler setup, and seq2seq-specific prediction steps.

Description

The Seq2SeqTrainer class extends transformers.Trainer to handle the specific needs of sequence-to-sequence training:

Initialization:

Accepts optional config and data_args parameters; if no config is provided, it falls back to the model's own config.
Handles FSMTConfig specially by using tgt_vocab_size instead of vocab_size.
When label_smoothing == 0, uses standard CrossEntropyLoss with ignore_index set to the pad token ID; otherwise dynamically imports label_smoothed_nll_loss.

Optimizer and Scheduler:

create_optimizer_and_scheduler(num_training_steps): Configures either AdamW or Adafactor optimizer with parameter groups (weight decay excluded for bias and LayerNorm). Supports FairScale's OSS (Optimizer State Sharding) for distributed training.
_get_lr_scheduler(num_training_steps): Maps scheduler name to implementation via arg_to_scheduler:

Scheduler Name	Implementation
`linear`	`get_linear_schedule_with_warmup`
`cosine`	`get_cosine_schedule_with_warmup`
`cosine_w_restarts`	`get_cosine_with_hard_restarts_schedule_with_warmup`
`polynomial`	`get_polynomial_decay_schedule_with_warmup`
`constant`	`get_constant_schedule`
`constant_w_warmup`	`get_constant_schedule_with_warmup`

Training:

_get_train_sampler(): Returns appropriate sampler (TPU, sortish, random, or distributed).
_compute_loss(model, inputs, labels): Handles three loss computation modes: (1) standard with pad token exclusion, (2) standard via model forward, (3) label-smoothed NLL loss.
compute_loss(model, inputs): Public interface that pops labels and delegates to _compute_loss.

Evaluation:

prediction_step(...): Overrides the Trainer's prediction step to optionally generate sequences (when predict_with_generate is enabled) and pad tensors to uniform length.
_pad_tensors_to_max_len(tensor, max_length): Pads generated or label tensors to the max generation length using the pad or EOS token ID.

⚠️ DEPRECATED: This file resides in the legacy/ directory and is not actively maintained. Prefer modern equivalents where available.

Usage

Use this trainer when:

Training seq2seq models (BART, T5, mBART, FSMT, Pegasus) with the HuggingFace training loop.
Needing label smoothing for seq2seq tasks.
Requiring custom learning rate schedulers beyond the standard Trainer defaults.
Evaluating with generation-based metrics (BLEU, ROUGE) via predict_with_generate.

Code Reference

Source Location

examples/NLU/examples/legacy/seq2seq/seq2seq_trainer.py (258 lines)

Signature

arg_to_scheduler = {
    "linear": get_linear_schedule_with_warmup,
    "cosine": get_cosine_schedule_with_warmup,
    "cosine_w_restarts": get_cosine_with_hard_restarts_schedule_with_warmup,
    "polynomial": get_polynomial_decay_schedule_with_warmup,
    "constant": get_constant_schedule,
    "constant_w_warmup": get_constant_schedule_with_warmup,
}

class Seq2SeqTrainer(Trainer):
    def __init__(self, config=None, data_args=None, *args, **kwargs): ...
    def create_optimizer_and_scheduler(self, num_training_steps: int) -> None: ...
    def _get_lr_scheduler(self, num_training_steps: int) -> scheduler: ...
    def _get_train_sampler(self) -> Optional[Sampler]: ...
    def _compute_loss(self, model, inputs, labels) -> Tuple[loss, logits]: ...
    def compute_loss(self, model, inputs) -> loss: ...
    def prediction_step(self, model, inputs, prediction_loss_only, ignore_keys=None) -> Tuple: ...
    def _pad_tensors_to_max_len(self, tensor, max_length) -> Tensor: ...

Import / CLI Usage

from seq2seq_trainer import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    data_args=data_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
)
trainer.train()

I/O Contract

Inputs

Input	Type	Description
`config`	PretrainedConfig (optional)	Model configuration; inferred from model if not provided
`data_args`	dataclass (optional)	Data arguments including `ignore_pad_token_for_loss`, `val_max_target_length`, `eval_beams`
`model`	PreTrainedModel	The seq2seq model to train
`args`	TrainingArguments	HuggingFace training arguments (includes `label_smoothing`, `lr_scheduler`, `adafactor`, `sortish_sampler`, `predict_with_generate`)
`train_dataset`, `eval_dataset`	Dataset	Training and evaluation datasets

Outputs

Output	Type	Description
Training loss	float	Cross-entropy or label-smoothed loss
Prediction output	Tuple[loss, logits/generated_tokens, labels]	Loss, generated sequences or logits, and padded labels during evaluation
Trained model	PreTrainedModel	Fine-tuned seq2seq model saved via Trainer's save methods

Usage Examples

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, TrainingArguments
from seq2seq_trainer import Seq2SeqTrainer

model = AutoModelForSeq2SeqLM.from_pretrained("facebook/bart-large-cnn")
tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")

training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=4,
    label_smoothing=0.1,
    predict_with_generate=True,
    lr_scheduler="cosine",
    warmup_steps=500,
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
)

# Train
trainer.train()

# Evaluate with generation
results = trainer.evaluate()

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment