Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Microsoft LoRA Run Distributed Eval

From Leeroopedia


Template:Implementation meta

Overview

The run_distributed_eval.py script performs distributed multi-GPU evaluation of seq2seq models, computing BLEU or ROUGE metrics across partitioned data using torch.distributed.

Description

This script enables scalable evaluation of seq2seq models (summarization, translation) by distributing the evaluation workload across multiple GPUs using PyTorch's NCCL distributed backend. The workflow is:

  1. Distributed Initialization: Each process initializes a torch.distributed process group with the NCCL backend, using its local_rank to select its GPU.
  2. Data Partitioning: The Seq2SeqDataset is loaded with a sortish_sampler configured for distributed mode, automatically partitioning data across ranks.
  3. Generation: Each rank runs model.generate() on its partition with configurable beam search parameters (num_beams, num_return_sequences) and saves results as JSON to rank_{rank}_output.json.
  4. Aggregation (Rank 0): The master process waits for all rank JSON files (with a configurable timeout, default 600s), combines them by sorting on example IDs, and computes metrics.
  5. Metrics: For translation tasks, BLEU is computed; for summarization, ROUGE is used. Metrics include throughput (seconds_per_sample) and GPU count.

Key functions:

  • eval_data_dir(...): Core distributed evaluation function. Initializes the process group, loads model and tokenizer, creates the distributed DataLoader, generates predictions, and saves per-rank results.
  • run_generate(): CLI entry point that parses arguments (including extra model.generate() kwargs), orchestrates evaluation, and handles result aggregation on rank 0.
  • combine_partial_results(partial_results): Sorts and flattens per-rank results by example ID.
  • gather_results_from_each_node(num_replicas, save_dir, timeout): Polls for rank JSON files with timeout handling.

The script supports FP16 inference, task-specific model parameters, configurable source/target languages, custom prefixes, and pseudolabel generation (when num_return_sequences > 1).

⚠️ DEPRECATED: This file resides in the legacy/ directory and is not actively maintained. Prefer modern equivalents where available.

Usage

Use this script when:

  • Evaluating large seq2seq models where single-GPU evaluation is too slow.
  • Computing BLEU scores for translation tasks across multiple GPUs.
  • Computing ROUGE scores for summarization tasks with distributed inference.
  • Generating pseudolabels from multiple return sequences for knowledge distillation.

Code Reference

Source Location

examples/NLU/examples/legacy/seq2seq/run_distributed_eval.py (261 lines)

Signature

def eval_data_dir(
    data_dir,
    save_dir: str,
    model_name: str,
    bs: int = 8,
    max_source_length: int = 1024,
    type_path: str = "val",
    n_obs: int = None,
    fp16: bool = False,
    task: str = "summarization",
    local_rank: int = None,
    num_return_sequences: int = 1,
    dataset_kwargs: Dict = None,
    prefix: str = "",
    **generate_kwargs,
) -> Tuple[List[Dict], int]: ...

def run_generate() -> None: ...
def combine_partial_results(partial_results: List) -> List: ...
def gather_results_from_each_node(num_replicas: int, save_dir, timeout: int) -> List[Dict[str, List]]: ...

Import / CLI Usage

# Launch with torch.distributed (e.g., 4 GPUs)
python -m torch.distributed.launch --nproc_per_node=4 \
    examples/legacy/seq2seq/run_distributed_eval.py \
    --model_name facebook/bart-large-cnn \
    --data_dir ./cnn_dm \
    --save_dir ./eval_output \
    --bs 8 \
    --task summarization \
    --fp16

# With extra generate kwargs
python -m torch.distributed.launch --nproc_per_node=2 \
    examples/legacy/seq2seq/run_distributed_eval.py \
    --model_name Helsinki-NLP/opus-mt-en-de \
    --data_dir ./wmt_en_de \
    --save_dir ./eval_output \
    --task translation \
    --num_beams=5 --length_penalty=1.0

I/O Contract

Inputs

Input Type Description
--model_name str HuggingFace model name or path. Default: sshleifer/distilbart-xsum-12-3
--data_dir str Directory containing {type_path}.source and {type_path}.target files
--save_dir str Output directory for metrics and generation files. Default: tmp_gen
--type_path str Data split to evaluate: train, val, or test. Default: test
--task str Task name (summarization or translation). Default: summarization
--bs int Batch size per GPU. Default: 8
--local_rank int GPU rank, passed by torch.distributed.launch. Default: -1
--fp16 flag Enable half-precision inference
--sync_timeout int Seconds for rank 0 to wait for other ranks. Default: 600
--num_return_sequences int Number of sequences to generate per input. Default: 1
--src_lang, --tgt_lang str Optional source/target language codes
--prefix str Optional prefix prepended to source examples
Extra kwargs varied Unrecognized args passed to model.generate() (e.g., --num_beams=5)

Outputs

Output Type Description
{save_dir}/{type_path}_{metric}.json JSON file Metrics dictionary (BLEU or ROUGE scores, n_obs, seconds_per_sample, n_gpus)
{save_dir}/{type_path}_generations.txt Text file Generated predictions, one per line
{save_dir}/pseudolabel_results.json JSON file Multi-sequence results when num_return_sequences > 1
{save_dir}_tmp/rank_{rank}_output.json JSON files Intermediate per-rank results (cleaned up unless --debug)

Usage Examples

# Distributed ROUGE evaluation of a summarization model on 4 GPUs
python -m torch.distributed.launch --nproc_per_node=4 \
    examples/legacy/seq2seq/run_distributed_eval.py \
    --model_name facebook/bart-large-cnn \
    --data_dir ./cnn_dm \
    --save_dir ./eval_results \
    --type_path test \
    --task summarization \
    --bs 16 \
    --fp16

# Output metrics (example):
# {'rouge1': 44.16, 'rouge2': 21.28, 'rougeL': 40.90, 'n_obs': 11490,
#  'seconds_per_sample': 0.0821, 'n_gpus': 4}

# Distributed BLEU evaluation for translation
python -m torch.distributed.launch --nproc_per_node=2 \
    examples/legacy/seq2seq/run_distributed_eval.py \
    --model_name Helsinki-NLP/opus-mt-en-de \
    --data_dir ./wmt_en_de \
    --save_dir ./translation_eval \
    --task translation \
    --num_beams=5

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment