Implementation:Microsoft LoRA Seq2Seq Trainer
Overview
The seq2seq_trainer.py module provides a custom Seq2SeqTrainer class that extends the HuggingFace Trainer with label smoothing, pad token loss exclusion, custom optimizer/scheduler setup, and seq2seq-specific prediction steps.
Description
The Seq2SeqTrainer class extends transformers.Trainer to handle the specific needs of sequence-to-sequence training:
Initialization:
- Accepts optional
configanddata_argsparameters; if no config is provided, it falls back to the model's own config. - Handles
FSMTConfigspecially by usingtgt_vocab_sizeinstead ofvocab_size. - When
label_smoothing == 0, uses standardCrossEntropyLosswithignore_indexset to the pad token ID; otherwise dynamically importslabel_smoothed_nll_loss.
Optimizer and Scheduler:
create_optimizer_and_scheduler(num_training_steps): Configures either AdamW or Adafactor optimizer with parameter groups (weight decay excluded for bias and LayerNorm). Supports FairScale'sOSS(Optimizer State Sharding) for distributed training._get_lr_scheduler(num_training_steps): Maps scheduler name to implementation viaarg_to_scheduler:
| Scheduler Name | Implementation |
|---|---|
linear |
get_linear_schedule_with_warmup
|
cosine |
get_cosine_schedule_with_warmup
|
cosine_w_restarts |
get_cosine_with_hard_restarts_schedule_with_warmup
|
polynomial |
get_polynomial_decay_schedule_with_warmup
|
constant |
get_constant_schedule
|
constant_w_warmup |
get_constant_schedule_with_warmup
|
Training:
_get_train_sampler(): Returns appropriate sampler (TPU, sortish, random, or distributed)._compute_loss(model, inputs, labels): Handles three loss computation modes: (1) standard with pad token exclusion, (2) standard via model forward, (3) label-smoothed NLL loss.compute_loss(model, inputs): Public interface that pops labels and delegates to_compute_loss.
Evaluation:
prediction_step(...): Overrides the Trainer's prediction step to optionally generate sequences (whenpredict_with_generateis enabled) and pad tensors to uniform length._pad_tensors_to_max_len(tensor, max_length): Pads generated or label tensors to the max generation length using the pad or EOS token ID.
⚠️ DEPRECATED: This file resides in the legacy/ directory and is not actively maintained. Prefer modern equivalents where available.
Usage
Use this trainer when:
- Training seq2seq models (BART, T5, mBART, FSMT, Pegasus) with the HuggingFace training loop.
- Needing label smoothing for seq2seq tasks.
- Requiring custom learning rate schedulers beyond the standard Trainer defaults.
- Evaluating with generation-based metrics (BLEU, ROUGE) via
predict_with_generate.
Code Reference
Source Location
examples/NLU/examples/legacy/seq2seq/seq2seq_trainer.py (258 lines)
Signature
arg_to_scheduler = {
"linear": get_linear_schedule_with_warmup,
"cosine": get_cosine_schedule_with_warmup,
"cosine_w_restarts": get_cosine_with_hard_restarts_schedule_with_warmup,
"polynomial": get_polynomial_decay_schedule_with_warmup,
"constant": get_constant_schedule,
"constant_w_warmup": get_constant_schedule_with_warmup,
}
class Seq2SeqTrainer(Trainer):
def __init__(self, config=None, data_args=None, *args, **kwargs): ...
def create_optimizer_and_scheduler(self, num_training_steps: int) -> None: ...
def _get_lr_scheduler(self, num_training_steps: int) -> scheduler: ...
def _get_train_sampler(self) -> Optional[Sampler]: ...
def _compute_loss(self, model, inputs, labels) -> Tuple[loss, logits]: ...
def compute_loss(self, model, inputs) -> loss: ...
def prediction_step(self, model, inputs, prediction_loss_only, ignore_keys=None) -> Tuple: ...
def _pad_tensors_to_max_len(self, tensor, max_length) -> Tensor: ...
Import / CLI Usage
from seq2seq_trainer import Seq2SeqTrainer
trainer = Seq2SeqTrainer(
model=model,
args=training_args,
data_args=data_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
tokenizer=tokenizer,
)
trainer.train()
I/O Contract
Inputs
| Input | Type | Description |
|---|---|---|
config |
PretrainedConfig (optional) | Model configuration; inferred from model if not provided |
data_args |
dataclass (optional) | Data arguments including ignore_pad_token_for_loss, val_max_target_length, eval_beams
|
model |
PreTrainedModel | The seq2seq model to train |
args |
TrainingArguments | HuggingFace training arguments (includes label_smoothing, lr_scheduler, adafactor, sortish_sampler, predict_with_generate)
|
train_dataset, eval_dataset |
Dataset | Training and evaluation datasets |
Outputs
| Output | Type | Description |
|---|---|---|
| Training loss | float | Cross-entropy or label-smoothed loss |
| Prediction output | Tuple[loss, logits/generated_tokens, labels] | Loss, generated sequences or logits, and padded labels during evaluation |
| Trained model | PreTrainedModel | Fine-tuned seq2seq model saved via Trainer's save methods |
Usage Examples
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, TrainingArguments
from seq2seq_trainer import Seq2SeqTrainer
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/bart-large-cnn")
tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")
training_args = TrainingArguments(
output_dir="./results",
per_device_train_batch_size=4,
label_smoothing=0.1,
predict_with_generate=True,
lr_scheduler="cosine",
warmup_steps=500,
)
trainer = Seq2SeqTrainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
tokenizer=tokenizer,
)
# Train
trainer.train()
# Evaluate with generation
results = trainer.evaluate()