Implementation:Microsoft LoRA Run Distributed Eval

Overview

The run_distributed_eval.py script performs distributed multi-GPU evaluation of seq2seq models, computing BLEU or ROUGE metrics across partitioned data using torch.distributed.

Description

This script enables scalable evaluation of seq2seq models (summarization, translation) by distributing the evaluation workload across multiple GPUs using PyTorch's NCCL distributed backend. The workflow is:

Distributed Initialization: Each process initializes a torch.distributed process group with the NCCL backend, using its local_rank to select its GPU.
Data Partitioning: The Seq2SeqDataset is loaded with a sortish_sampler configured for distributed mode, automatically partitioning data across ranks.
Generation: Each rank runs model.generate() on its partition with configurable beam search parameters (num_beams, num_return_sequences) and saves results as JSON to rank_{rank}_output.json.
Aggregation (Rank 0): The master process waits for all rank JSON files (with a configurable timeout, default 600s), combines them by sorting on example IDs, and computes metrics.
Metrics: For translation tasks, BLEU is computed; for summarization, ROUGE is used. Metrics include throughput (seconds_per_sample) and GPU count.

Key functions:

eval_data_dir(...): Core distributed evaluation function. Initializes the process group, loads model and tokenizer, creates the distributed DataLoader, generates predictions, and saves per-rank results.
run_generate(): CLI entry point that parses arguments (including extra model.generate() kwargs), orchestrates evaluation, and handles result aggregation on rank 0.
combine_partial_results(partial_results): Sorts and flattens per-rank results by example ID.
gather_results_from_each_node(num_replicas, save_dir, timeout): Polls for rank JSON files with timeout handling.

The script supports FP16 inference, task-specific model parameters, configurable source/target languages, custom prefixes, and pseudolabel generation (when num_return_sequences > 1).

⚠️ DEPRECATED: This file resides in the legacy/ directory and is not actively maintained. Prefer modern equivalents where available.

Usage

Use this script when:

Evaluating large seq2seq models where single-GPU evaluation is too slow.
Computing BLEU scores for translation tasks across multiple GPUs.
Computing ROUGE scores for summarization tasks with distributed inference.
Generating pseudolabels from multiple return sequences for knowledge distillation.

Code Reference

Source Location

examples/NLU/examples/legacy/seq2seq/run_distributed_eval.py (261 lines)

Signature

def eval_data_dir(
    data_dir,
    save_dir: str,
    model_name: str,
    bs: int = 8,
    max_source_length: int = 1024,
    type_path: str = "val",
    n_obs: int = None,
    fp16: bool = False,
    task: str = "summarization",
    local_rank: int = None,
    num_return_sequences: int = 1,
    dataset_kwargs: Dict = None,
    prefix: str = "",
    **generate_kwargs,
) -> Tuple[List[Dict], int]: ...

def run_generate() -> None: ...
def combine_partial_results(partial_results: List) -> List: ...
def gather_results_from_each_node(num_replicas: int, save_dir, timeout: int) -> List[Dict[str, List]]: ...

Import / CLI Usage

# Launch with torch.distributed (e.g., 4 GPUs)
python -m torch.distributed.launch --nproc_per_node=4 \
    examples/legacy/seq2seq/run_distributed_eval.py \
    --model_name facebook/bart-large-cnn \
    --data_dir ./cnn_dm \
    --save_dir ./eval_output \
    --bs 8 \
    --task summarization \
    --fp16

# With extra generate kwargs
python -m torch.distributed.launch --nproc_per_node=2 \
    examples/legacy/seq2seq/run_distributed_eval.py \
    --model_name Helsinki-NLP/opus-mt-en-de \
    --data_dir ./wmt_en_de \
    --save_dir ./eval_output \
    --task translation \
    --num_beams=5 --length_penalty=1.0

I/O Contract

Inputs

Input	Type	Description
`--model_name`	str	HuggingFace model name or path. Default: `sshleifer/distilbart-xsum-12-3`
`--data_dir`	str	Directory containing `{type_path}.source` and `{type_path}.target` files
`--save_dir`	str	Output directory for metrics and generation files. Default: `tmp_gen`
`--type_path`	str	Data split to evaluate: `train`, `val`, or `test`. Default: `test`
`--task`	str	Task name (`summarization` or `translation`). Default: `summarization`
`--bs`	int	Batch size per GPU. Default: 8
`--local_rank`	int	GPU rank, passed by `torch.distributed.launch`. Default: -1
`--fp16`	flag	Enable half-precision inference
`--sync_timeout`	int	Seconds for rank 0 to wait for other ranks. Default: 600
`--num_return_sequences`	int	Number of sequences to generate per input. Default: 1
`--src_lang`, `--tgt_lang`	str	Optional source/target language codes
`--prefix`	str	Optional prefix prepended to source examples
Extra kwargs	varied	Unrecognized args passed to `model.generate()` (e.g., `--num_beams=5`)

Outputs

Output	Type	Description
`{save_dir}/{type_path}_{metric}.json`	JSON file	Metrics dictionary (BLEU or ROUGE scores, n_obs, seconds_per_sample, n_gpus)
`{save_dir}/{type_path}_generations.txt`	Text file	Generated predictions, one per line
`{save_dir}/pseudolabel_results.json`	JSON file	Multi-sequence results when `num_return_sequences > 1`
`{save_dir}_tmp/rank_{rank}_output.json`	JSON files	Intermediate per-rank results (cleaned up unless `--debug`)

Usage Examples

# Distributed ROUGE evaluation of a summarization model on 4 GPUs
python -m torch.distributed.launch --nproc_per_node=4 \
    examples/legacy/seq2seq/run_distributed_eval.py \
    --model_name facebook/bart-large-cnn \
    --data_dir ./cnn_dm \
    --save_dir ./eval_results \
    --type_path test \
    --task summarization \
    --bs 16 \
    --fp16

# Output metrics (example):
# {'rouge1': 44.16, 'rouge2': 21.28, 'rougeL': 40.90, 'n_obs': 11490,
#  'seconds_per_sample': 0.0821, 'n_gpus': 4}

# Distributed BLEU evaluation for translation
python -m torch.distributed.launch --nproc_per_node=2 \
    examples/legacy/seq2seq/run_distributed_eval.py \
    --model_name Helsinki-NLP/opus-mt-en-de \
    --data_dir ./wmt_en_de \
    --save_dir ./translation_eval \
    --task translation \
    --num_beams=5

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment