Implementation:Microsoft LoRA Run Translation

Overview

run_translation.py is a sequence-to-sequence fine-tuning script for machine translation using AutoModelForSeq2SeqLM, Seq2SeqTrainer, and SacreBLEU metrics.

Description

This script fine-tunes encoder-decoder models (BART, mBART, mT5, T5, MarianMT) on translation datasets. It shares the same architectural pattern as run_summarization.py but is specialized for bilingual translation with language-specific handling.

Key implementation details:

Language pair configuration: Requires --source_lang and --target_lang arguments. The DataTrainingArguments.__post_init__ validates that both are provided.
mBART special handling: Detects MBartTokenizer or MBartTokenizerFast instances and sets tokenizer.src_lang and tokenizer.tgt_lang. For mBART, also sets model.config.decoder_start_token_id from the target language code.
JSON-only data format for custom files: Custom training/validation/test files must be JSON with a "translation" field containing source and target language keys (e.g., {"translation": {"en": "Hello", "de": "Hallo"}}).
Data preprocessing: Extracts source and target texts from the translation field using the language code keys (split on "_" for mBART language codes like en_XX). Applies optional source_prefix.
Target tokenization: Uses tokenizer.as_target_tokenizer() context manager to ensure proper target language tokenization.
Metrics: Computes SacreBLEU score using the sacrebleu metric from datasets library. Labels are wrapped in a list (reference format for BLEU). Reports BLEU score plus average generation length.
Three-phase pipeline: Supports do_train, do_eval, and do_predict with decoded translations saved to test_generations.txt.

T5 model detection warns if --source_prefix is not provided (e.g., "translate English to German: ").

Usage

Use this script when you need to:

Fine-tune seq2seq models on machine translation tasks
Work with mBART, mT5, MarianMT, or T5 for translation
Evaluate with SacreBLEU metrics and beam search generation

Code Reference

Source Location

Property	Value
File	`examples/NLU/examples/seq2seq/run_translation.py`
Lines	562
Module	`run_translation`
Entry Point	`main()`

Signature/CLI

python run_translation.py \
    --model_name_or_path MODEL_NAME \
    --source_lang SOURCE_LANG \
    --target_lang TARGET_LANG \
    --dataset_name DATASET_NAME \
    --output_dir OUTPUT_DIR \
    --do_train \
    --do_eval \
    [--do_predict] \
    [--dataset_config_name CONFIG] \
    [--train_file TRAIN_FILE] \
    [--validation_file VALIDATION_FILE] \
    [--test_file TEST_FILE] \
    [--source_prefix PREFIX] \
    [--max_source_length 1024] \
    [--max_target_length 128] \
    [--num_beams 4] \
    [--predict_with_generate]

Import

from transformers import (
    AutoConfig,
    AutoModelForSeq2SeqLM,
    AutoTokenizer,
    DataCollatorForSeq2Seq,
    HfArgumentParser,
    MBartTokenizer,
    MBartTokenizerFast,
    Seq2SeqTrainer,
    Seq2SeqTrainingArguments,
    default_data_collator,
    set_seed,
)
from datasets import load_dataset, load_metric

I/O Contract

Inputs

Parameter	Type	Required	Default	Description
`--model_name_or_path`	str	Yes	-	Pretrained model (e.g., `Helsinki-NLP/opus-mt-en-de`)
`--source_lang`	str	Yes	-	Source language code (e.g., `en` or `en_XX` for mBART)
`--target_lang`	str	Yes	-	Target language code (e.g., `de` or `de_DE` for mBART)
`--output_dir`	str	Yes	-	Directory for checkpoints and results
`--dataset_name`	str	No	None	HuggingFace dataset name (e.g., `wmt16`)
`--train_file`	str	No	None	Custom JSON training file with `translation` field
`--source_prefix`	str	No	None	Prefix for source text (e.g., `"translate English to German: "`)
`--max_source_length`	int	No	1024	Max source tokenized length
`--max_target_length`	int	No	128	Max target tokenized length
`--num_beams`	int	No	None	Beam search width for generation
`--ignore_pad_token_for_loss`	bool	No	True	Replace pad tokens with -100 in labels

Outputs

Output	Location	Description
Trained model	`{output_dir}/`	Saved model, config, and tokenizer
Training metrics	`{output_dir}/train_results.json`	Training loss and throughput
Evaluation metrics	`{output_dir}/eval_results.json`	SacreBLEU score and gen_len
Test metrics	`{output_dir}/test_results.json`	BLEU score on test set
Test generations	`{output_dir}/test_generations.txt`	Decoded translations, one per line

Usage Examples

Fine-tune MarianMT for English-to-German

python examples/NLU/examples/seq2seq/run_translation.py \
    --model_name_or_path Helsinki-NLP/opus-mt-en-de \
    --source_lang en \
    --target_lang de \
    --dataset_name wmt16 \
    --dataset_config_name de-en \
    --do_train \
    --do_eval \
    --predict_with_generate \
    --per_device_train_batch_size 8 \
    --num_beams 4 \
    --output_dir /tmp/en_de_translation

Fine-tune mBART for multilingual translation

python examples/NLU/examples/seq2seq/run_translation.py \
    --model_name_or_path facebook/mbart-large-cc25 \
    --source_lang en_XX \
    --target_lang de_DE \
    --dataset_name wmt16 \
    --dataset_config_name de-en \
    --do_train \
    --do_eval \
    --do_predict \
    --predict_with_generate \
    --output_dir /tmp/mbart_translation

Fine-tune T5 with source prefix

python examples/NLU/examples/seq2seq/run_translation.py \
    --model_name_or_path t5-base \
    --source_lang en \
    --target_lang de \
    --source_prefix "translate English to German: " \
    --train_file /path/to/train.json \
    --validation_file /path/to/val.json \
    --do_train \
    --do_eval \
    --predict_with_generate \
    --output_dir /tmp/t5_translation

Related Pages

Environment:Microsoft_LoRA_NLU_Conda_Environment
Implementation:Microsoft_LoRA_Run_Summarization - Similar seq2seq fine-tuning for summarization

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment