Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Microsoft LoRA Run Translation

From Leeroopedia
Revision as of 15:44, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Microsoft_LoRA_Run_Translation.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Template:Implementation metadata

Overview

run_translation.py is a sequence-to-sequence fine-tuning script for machine translation using AutoModelForSeq2SeqLM, Seq2SeqTrainer, and SacreBLEU metrics.

Description

This script fine-tunes encoder-decoder models (BART, mBART, mT5, T5, MarianMT) on translation datasets. It shares the same architectural pattern as run_summarization.py but is specialized for bilingual translation with language-specific handling.

Key implementation details:

  • Language pair configuration: Requires --source_lang and --target_lang arguments. The DataTrainingArguments.__post_init__ validates that both are provided.
  • mBART special handling: Detects MBartTokenizer or MBartTokenizerFast instances and sets tokenizer.src_lang and tokenizer.tgt_lang. For mBART, also sets model.config.decoder_start_token_id from the target language code.
  • JSON-only data format for custom files: Custom training/validation/test files must be JSON with a "translation" field containing source and target language keys (e.g., {"translation": {"en": "Hello", "de": "Hallo"}}).
  • Data preprocessing: Extracts source and target texts from the translation field using the language code keys (split on "_" for mBART language codes like en_XX). Applies optional source_prefix.
  • Target tokenization: Uses tokenizer.as_target_tokenizer() context manager to ensure proper target language tokenization.
  • Metrics: Computes SacreBLEU score using the sacrebleu metric from datasets library. Labels are wrapped in a list (reference format for BLEU). Reports BLEU score plus average generation length.
  • Three-phase pipeline: Supports do_train, do_eval, and do_predict with decoded translations saved to test_generations.txt.

T5 model detection warns if --source_prefix is not provided (e.g., "translate English to German: ").

Usage

Use this script when you need to:

  • Fine-tune seq2seq models on machine translation tasks
  • Work with mBART, mT5, MarianMT, or T5 for translation
  • Evaluate with SacreBLEU metrics and beam search generation

Code Reference

Source Location

Property Value
File examples/NLU/examples/seq2seq/run_translation.py
Lines 562
Module run_translation
Entry Point main()

Signature/CLI

python run_translation.py \
    --model_name_or_path MODEL_NAME \
    --source_lang SOURCE_LANG \
    --target_lang TARGET_LANG \
    --dataset_name DATASET_NAME \
    --output_dir OUTPUT_DIR \
    --do_train \
    --do_eval \
    [--do_predict] \
    [--dataset_config_name CONFIG] \
    [--train_file TRAIN_FILE] \
    [--validation_file VALIDATION_FILE] \
    [--test_file TEST_FILE] \
    [--source_prefix PREFIX] \
    [--max_source_length 1024] \
    [--max_target_length 128] \
    [--num_beams 4] \
    [--predict_with_generate]

Import

from transformers import (
    AutoConfig,
    AutoModelForSeq2SeqLM,
    AutoTokenizer,
    DataCollatorForSeq2Seq,
    HfArgumentParser,
    MBartTokenizer,
    MBartTokenizerFast,
    Seq2SeqTrainer,
    Seq2SeqTrainingArguments,
    default_data_collator,
    set_seed,
)
from datasets import load_dataset, load_metric

I/O Contract

Inputs

Parameter Type Required Default Description
--model_name_or_path str Yes - Pretrained model (e.g., Helsinki-NLP/opus-mt-en-de)
--source_lang str Yes - Source language code (e.g., en or en_XX for mBART)
--target_lang str Yes - Target language code (e.g., de or de_DE for mBART)
--output_dir str Yes - Directory for checkpoints and results
--dataset_name str No None HuggingFace dataset name (e.g., wmt16)
--train_file str No None Custom JSON training file with translation field
--source_prefix str No None Prefix for source text (e.g., "translate English to German: ")
--max_source_length int No 1024 Max source tokenized length
--max_target_length int No 128 Max target tokenized length
--num_beams int No None Beam search width for generation
--ignore_pad_token_for_loss bool No True Replace pad tokens with -100 in labels

Outputs

Output Location Description
Trained model {output_dir}/ Saved model, config, and tokenizer
Training metrics {output_dir}/train_results.json Training loss and throughput
Evaluation metrics {output_dir}/eval_results.json SacreBLEU score and gen_len
Test metrics {output_dir}/test_results.json BLEU score on test set
Test generations {output_dir}/test_generations.txt Decoded translations, one per line

Usage Examples

Fine-tune MarianMT for English-to-German

python examples/NLU/examples/seq2seq/run_translation.py \
    --model_name_or_path Helsinki-NLP/opus-mt-en-de \
    --source_lang en \
    --target_lang de \
    --dataset_name wmt16 \
    --dataset_config_name de-en \
    --do_train \
    --do_eval \
    --predict_with_generate \
    --per_device_train_batch_size 8 \
    --num_beams 4 \
    --output_dir /tmp/en_de_translation

Fine-tune mBART for multilingual translation

python examples/NLU/examples/seq2seq/run_translation.py \
    --model_name_or_path facebook/mbart-large-cc25 \
    --source_lang en_XX \
    --target_lang de_DE \
    --dataset_name wmt16 \
    --dataset_config_name de-en \
    --do_train \
    --do_eval \
    --do_predict \
    --predict_with_generate \
    --output_dir /tmp/mbart_translation

Fine-tune T5 with source prefix

python examples/NLU/examples/seq2seq/run_translation.py \
    --model_name_or_path t5-base \
    --source_lang en \
    --target_lang de \
    --source_prefix "translate English to German: " \
    --train_file /path/to/train.json \
    --validation_file /path/to/val.json \
    --do_train \
    --do_eval \
    --predict_with_generate \
    --output_dir /tmp/t5_translation

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment