Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Microsoft LoRA Legacy Finetune Trainer Seq2Seq

From Leeroopedia


Template:Implementation metadata

Overview

Seq2seq fine-tuning script using Seq2SeqTrainer for summarization and translation tasks with BART, mBART, mT5, and other encoder-decoder models.

Description

finetune_trainer.py is a legacy HuggingFace Transformers seq2seq example script included in the Microsoft LoRA NLU example directory. It uses a custom Seq2SeqTrainer (from a co-located seq2seq_trainer module) and Seq2SeqTrainingArguments to fine-tune encoder-decoder models for conditional text generation tasks such as summarization and translation.

The script uses HfArgumentParser to parse three structured dataclasses: ModelArguments (model path, config, tokenizer, freeze options), DataTrainingArguments (data directory, task type, sequence lengths, beam search parameters, language IDs), and Seq2SeqTrainingArguments (extending standard TrainingArguments). It supports:

  • Model freezing: Optional freezing of encoder parameters and/or embedding layers via freeze_embeds() and freeze_params()
  • mBART language handling: Automatic decoder_start_token_id configuration for mBART models based on target language
  • Task-specific parameters: Applies model config task-specific params via use_task_specific_params()
  • Custom metrics: Builds ROUGE (summarization) or BLEU (translation) compute_metrics function via build_compute_metrics_fn()
  • JSON config support: Can parse arguments from a JSON file when a single argument is a .json file path

The pipeline supports train, evaluate, and predict phases, saving all metrics to JSON and optionally writing decoded test predictions to a text file.

This script is part of the HuggingFace Transformers library (legacy examples) bundled in the Microsoft LoRA repository.

⚠️ DEPRECATED: This file resides in the legacy/ directory and is not actively maintained. Prefer modern equivalents where available.

Usage

Use this script to fine-tune BART, mBART, T5, mT5, Pegasus, or other seq2seq models for summarization or translation tasks. It expects data in the standard seq2seq format with .source and .target files for each split (train, val, test).

Code Reference

Source Location

Property Value
File path examples/NLU/examples/legacy/seq2seq/finetune_trainer.py
Lines 367
Module finetune_trainer

Key Classes and Functions

Name Type Signature / Description
ModelArguments dataclass Fields: model_name_or_path, config_name, tokenizer_name, cache_dir, freeze_encoder, freeze_embeds
DataTrainingArguments dataclass Fields: data_dir, task, max_source_length, max_target_length, val_max_target_length, test_max_target_length, n_train, n_val, n_test, src_lang, tgt_lang, eval_beams, ignore_pad_token_for_loss
handle_metrics function handle_metrics(split, metrics, output_dir) -- logs and saves metrics to JSON
main function Entry point: parses args, builds model/tokenizer/datasets/trainer, runs train/eval/predict
_mp_fn function TPU spawn entry point

CLI Usage

python finetune_trainer.py \
  --model_name_or_path facebook/bart-large \
  --data_dir /path/to/summarization_data \
  --output_dir /path/to/output \
  --do_train \
  --do_eval \
  --task summarization \
  --max_source_length 1024 \
  --max_target_length 128 \
  --per_device_train_batch_size 4

I/O Contract

Inputs

Input Type Description
--model_name_or_path str (required) Pretrained seq2seq model name or path
--data_dir str (required) Directory containing {split}.source and {split}.target files
--task str (default "summarization") Task name: summarization, summarization_{dataset}, or translation
--max_source_length int (default 1024) Maximum source sequence length
--max_target_length int (default 128) Maximum target sequence length for training
--val_max_target_length int (default 142) Maximum target length for validation (also used for model.generate max_length)
--test_max_target_length int (default 142) Maximum target length for test prediction
--eval_beams Optional[int] Number of beams for evaluation generation (defaults to model.config.num_beams)
--freeze_encoder flag Freeze all encoder parameters
--freeze_embeds flag Freeze embedding layers
--src_lang Optional[str] Source language ID (required for mBART)
--tgt_lang Optional[str] Target language ID (required for mBART)
--n_train int (default -1) Number of training examples (-1 for all)

Data Directory Structure

data_dir/
  train.source    # One source document per line
  train.target    # One target summary/translation per line
  val.source
  val.target
  test.source
  test.target

Outputs

Output Type Description
train_results.json JSON Training metrics
val_results.json JSON Validation metrics (loss, ROUGE or BLEU)
test_results.json JSON Test metrics
all_results.json JSON Combined metrics from all phases
test_generations.txt text file Decoded test predictions (when predict_with_generate is enabled)
trainer_state.json JSON Trainer state for resuming
Saved model directory Model, tokenizer, and config saved to output_dir

Usage Examples

Summarization with BART

python finetune_trainer.py \
  --model_name_or_path facebook/bart-large-cnn \
  --data_dir /data/cnn_dm/ \
  --output_dir /output/bart_summarization/ \
  --do_train \
  --do_eval \
  --do_predict \
  --task summarization \
  --max_source_length 1024 \
  --max_target_length 142 \
  --val_max_target_length 142 \
  --per_device_train_batch_size 4 \
  --per_device_eval_batch_size 4 \
  --predict_with_generate \
  --eval_beams 4 \
  --overwrite_output_dir

Translation with mBART

python finetune_trainer.py \
  --model_name_or_path facebook/mbart-large-cc25 \
  --data_dir /data/en_de_translation/ \
  --output_dir /output/mbart_translation/ \
  --do_train \
  --do_eval \
  --task translation \
  --src_lang en_XX \
  --tgt_lang de_DE \
  --max_source_length 512 \
  --max_target_length 128 \
  --per_device_train_batch_size 8 \
  --predict_with_generate \
  --overwrite_output_dir

Training with Frozen Embeddings

python finetune_trainer.py \
  --model_name_or_path facebook/bart-large \
  --data_dir /data/summarization/ \
  --output_dir /output/bart_frozen/ \
  --do_train \
  --do_eval \
  --task summarization \
  --freeze_embeds \
  --max_source_length 1024 \
  --max_target_length 128 \
  --overwrite_output_dir

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment