Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Microsoft LoRA Seq2Seq Trainer

From Leeroopedia


Template:Implementation meta

Overview

The seq2seq_trainer.py module provides a custom Seq2SeqTrainer class that extends the HuggingFace Trainer with label smoothing, pad token loss exclusion, custom optimizer/scheduler setup, and seq2seq-specific prediction steps.

Description

The Seq2SeqTrainer class extends transformers.Trainer to handle the specific needs of sequence-to-sequence training:

Initialization:

  • Accepts optional config and data_args parameters; if no config is provided, it falls back to the model's own config.
  • Handles FSMTConfig specially by using tgt_vocab_size instead of vocab_size.
  • When label_smoothing == 0, uses standard CrossEntropyLoss with ignore_index set to the pad token ID; otherwise dynamically imports label_smoothed_nll_loss.

Optimizer and Scheduler:

  • create_optimizer_and_scheduler(num_training_steps): Configures either AdamW or Adafactor optimizer with parameter groups (weight decay excluded for bias and LayerNorm). Supports FairScale's OSS (Optimizer State Sharding) for distributed training.
  • _get_lr_scheduler(num_training_steps): Maps scheduler name to implementation via arg_to_scheduler:
Scheduler Name Implementation
linear get_linear_schedule_with_warmup
cosine get_cosine_schedule_with_warmup
cosine_w_restarts get_cosine_with_hard_restarts_schedule_with_warmup
polynomial get_polynomial_decay_schedule_with_warmup
constant get_constant_schedule
constant_w_warmup get_constant_schedule_with_warmup

Training:

  • _get_train_sampler(): Returns appropriate sampler (TPU, sortish, random, or distributed).
  • _compute_loss(model, inputs, labels): Handles three loss computation modes: (1) standard with pad token exclusion, (2) standard via model forward, (3) label-smoothed NLL loss.
  • compute_loss(model, inputs): Public interface that pops labels and delegates to _compute_loss.

Evaluation:

  • prediction_step(...): Overrides the Trainer's prediction step to optionally generate sequences (when predict_with_generate is enabled) and pad tensors to uniform length.
  • _pad_tensors_to_max_len(tensor, max_length): Pads generated or label tensors to the max generation length using the pad or EOS token ID.

⚠️ DEPRECATED: This file resides in the legacy/ directory and is not actively maintained. Prefer modern equivalents where available.

Usage

Use this trainer when:

  • Training seq2seq models (BART, T5, mBART, FSMT, Pegasus) with the HuggingFace training loop.
  • Needing label smoothing for seq2seq tasks.
  • Requiring custom learning rate schedulers beyond the standard Trainer defaults.
  • Evaluating with generation-based metrics (BLEU, ROUGE) via predict_with_generate.

Code Reference

Source Location

examples/NLU/examples/legacy/seq2seq/seq2seq_trainer.py (258 lines)

Signature

arg_to_scheduler = {
    "linear": get_linear_schedule_with_warmup,
    "cosine": get_cosine_schedule_with_warmup,
    "cosine_w_restarts": get_cosine_with_hard_restarts_schedule_with_warmup,
    "polynomial": get_polynomial_decay_schedule_with_warmup,
    "constant": get_constant_schedule,
    "constant_w_warmup": get_constant_schedule_with_warmup,
}

class Seq2SeqTrainer(Trainer):
    def __init__(self, config=None, data_args=None, *args, **kwargs): ...
    def create_optimizer_and_scheduler(self, num_training_steps: int) -> None: ...
    def _get_lr_scheduler(self, num_training_steps: int) -> scheduler: ...
    def _get_train_sampler(self) -> Optional[Sampler]: ...
    def _compute_loss(self, model, inputs, labels) -> Tuple[loss, logits]: ...
    def compute_loss(self, model, inputs) -> loss: ...
    def prediction_step(self, model, inputs, prediction_loss_only, ignore_keys=None) -> Tuple: ...
    def _pad_tensors_to_max_len(self, tensor, max_length) -> Tensor: ...

Import / CLI Usage

from seq2seq_trainer import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    data_args=data_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
)
trainer.train()

I/O Contract

Inputs

Input Type Description
config PretrainedConfig (optional) Model configuration; inferred from model if not provided
data_args dataclass (optional) Data arguments including ignore_pad_token_for_loss, val_max_target_length, eval_beams
model PreTrainedModel The seq2seq model to train
args TrainingArguments HuggingFace training arguments (includes label_smoothing, lr_scheduler, adafactor, sortish_sampler, predict_with_generate)
train_dataset, eval_dataset Dataset Training and evaluation datasets

Outputs

Output Type Description
Training loss float Cross-entropy or label-smoothed loss
Prediction output Tuple[loss, logits/generated_tokens, labels] Loss, generated sequences or logits, and padded labels during evaluation
Trained model PreTrainedModel Fine-tuned seq2seq model saved via Trainer's save methods

Usage Examples

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, TrainingArguments
from seq2seq_trainer import Seq2SeqTrainer

model = AutoModelForSeq2SeqLM.from_pretrained("facebook/bart-large-cnn")
tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")

training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=4,
    label_smoothing=0.1,
    predict_with_generate=True,
    lr_scheduler="cosine",
    warmup_steps=500,
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
)

# Train
trainer.train()

# Evaluate with generation
results = trainer.evaluate()

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment