Implementation:Microsoft LoRA Run Translation
Template:Implementation metadata
Overview
run_translation.py is a sequence-to-sequence fine-tuning script for machine translation using AutoModelForSeq2SeqLM, Seq2SeqTrainer, and SacreBLEU metrics.
Description
This script fine-tunes encoder-decoder models (BART, mBART, mT5, T5, MarianMT) on translation datasets. It shares the same architectural pattern as run_summarization.py but is specialized for bilingual translation with language-specific handling.
Key implementation details:
- Language pair configuration: Requires
--source_langand--target_langarguments. TheDataTrainingArguments.__post_init__validates that both are provided. - mBART special handling: Detects
MBartTokenizerorMBartTokenizerFastinstances and setstokenizer.src_langandtokenizer.tgt_lang. For mBART, also setsmodel.config.decoder_start_token_idfrom the target language code. - JSON-only data format for custom files: Custom training/validation/test files must be JSON with a
"translation"field containing source and target language keys (e.g.,{"translation": {"en": "Hello", "de": "Hallo"}}). - Data preprocessing: Extracts source and target texts from the
translationfield using the language code keys (split on "_" for mBART language codes likeen_XX). Applies optionalsource_prefix. - Target tokenization: Uses
tokenizer.as_target_tokenizer()context manager to ensure proper target language tokenization. - Metrics: Computes SacreBLEU score using the
sacrebleumetric from datasets library. Labels are wrapped in a list (reference format for BLEU). Reports BLEU score plus average generation length. - Three-phase pipeline: Supports
do_train,do_eval, anddo_predictwith decoded translations saved totest_generations.txt.
T5 model detection warns if --source_prefix is not provided (e.g., "translate English to German: ").
Usage
Use this script when you need to:
- Fine-tune seq2seq models on machine translation tasks
- Work with mBART, mT5, MarianMT, or T5 for translation
- Evaluate with SacreBLEU metrics and beam search generation
Code Reference
Source Location
| Property | Value |
|---|---|
| File | examples/NLU/examples/seq2seq/run_translation.py
|
| Lines | 562 |
| Module | run_translation
|
| Entry Point | main()
|
Signature/CLI
python run_translation.py \
--model_name_or_path MODEL_NAME \
--source_lang SOURCE_LANG \
--target_lang TARGET_LANG \
--dataset_name DATASET_NAME \
--output_dir OUTPUT_DIR \
--do_train \
--do_eval \
[--do_predict] \
[--dataset_config_name CONFIG] \
[--train_file TRAIN_FILE] \
[--validation_file VALIDATION_FILE] \
[--test_file TEST_FILE] \
[--source_prefix PREFIX] \
[--max_source_length 1024] \
[--max_target_length 128] \
[--num_beams 4] \
[--predict_with_generate]
Import
from transformers import (
AutoConfig,
AutoModelForSeq2SeqLM,
AutoTokenizer,
DataCollatorForSeq2Seq,
HfArgumentParser,
MBartTokenizer,
MBartTokenizerFast,
Seq2SeqTrainer,
Seq2SeqTrainingArguments,
default_data_collator,
set_seed,
)
from datasets import load_dataset, load_metric
I/O Contract
Inputs
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
--model_name_or_path |
str | Yes | - | Pretrained model (e.g., Helsinki-NLP/opus-mt-en-de)
|
--source_lang |
str | Yes | - | Source language code (e.g., en or en_XX for mBART)
|
--target_lang |
str | Yes | - | Target language code (e.g., de or de_DE for mBART)
|
--output_dir |
str | Yes | - | Directory for checkpoints and results |
--dataset_name |
str | No | None | HuggingFace dataset name (e.g., wmt16)
|
--train_file |
str | No | None | Custom JSON training file with translation field
|
--source_prefix |
str | No | None | Prefix for source text (e.g., "translate English to German: ")
|
--max_source_length |
int | No | 1024 | Max source tokenized length |
--max_target_length |
int | No | 128 | Max target tokenized length |
--num_beams |
int | No | None | Beam search width for generation |
--ignore_pad_token_for_loss |
bool | No | True | Replace pad tokens with -100 in labels |
Outputs
| Output | Location | Description |
|---|---|---|
| Trained model | {output_dir}/ |
Saved model, config, and tokenizer |
| Training metrics | {output_dir}/train_results.json |
Training loss and throughput |
| Evaluation metrics | {output_dir}/eval_results.json |
SacreBLEU score and gen_len |
| Test metrics | {output_dir}/test_results.json |
BLEU score on test set |
| Test generations | {output_dir}/test_generations.txt |
Decoded translations, one per line |
Usage Examples
Fine-tune MarianMT for English-to-German
python examples/NLU/examples/seq2seq/run_translation.py \
--model_name_or_path Helsinki-NLP/opus-mt-en-de \
--source_lang en \
--target_lang de \
--dataset_name wmt16 \
--dataset_config_name de-en \
--do_train \
--do_eval \
--predict_with_generate \
--per_device_train_batch_size 8 \
--num_beams 4 \
--output_dir /tmp/en_de_translation
Fine-tune mBART for multilingual translation
python examples/NLU/examples/seq2seq/run_translation.py \
--model_name_or_path facebook/mbart-large-cc25 \
--source_lang en_XX \
--target_lang de_DE \
--dataset_name wmt16 \
--dataset_config_name de-en \
--do_train \
--do_eval \
--do_predict \
--predict_with_generate \
--output_dir /tmp/mbart_translation
Fine-tune T5 with source prefix
python examples/NLU/examples/seq2seq/run_translation.py \
--model_name_or_path t5-base \
--source_lang en \
--target_lang de \
--source_prefix "translate English to German: " \
--train_file /path/to/train.json \
--validation_file /path/to/val.json \
--do_train \
--do_eval \
--predict_with_generate \
--output_dir /tmp/t5_translation
Related Pages
- Environment:Microsoft_LoRA_NLU_Conda_Environment
- Implementation:Microsoft_LoRA_Run_Summarization - Similar seq2seq fine-tuning for summarization