Implementation:Microsoft LoRA Run Distributed Eval
Overview
The run_distributed_eval.py script performs distributed multi-GPU evaluation of seq2seq models, computing BLEU or ROUGE metrics across partitioned data using torch.distributed.
Description
This script enables scalable evaluation of seq2seq models (summarization, translation) by distributing the evaluation workload across multiple GPUs using PyTorch's NCCL distributed backend. The workflow is:
- Distributed Initialization: Each process initializes a
torch.distributedprocess group with the NCCL backend, using itslocal_rankto select its GPU. - Data Partitioning: The
Seq2SeqDatasetis loaded with asortish_samplerconfigured for distributed mode, automatically partitioning data across ranks. - Generation: Each rank runs
model.generate()on its partition with configurable beam search parameters (num_beams,num_return_sequences) and saves results as JSON torank_{rank}_output.json. - Aggregation (Rank 0): The master process waits for all rank JSON files (with a configurable timeout, default 600s), combines them by sorting on example IDs, and computes metrics.
- Metrics: For translation tasks, BLEU is computed; for summarization, ROUGE is used. Metrics include throughput (
seconds_per_sample) and GPU count.
Key functions:
eval_data_dir(...): Core distributed evaluation function. Initializes the process group, loads model and tokenizer, creates the distributed DataLoader, generates predictions, and saves per-rank results.run_generate(): CLI entry point that parses arguments (including extramodel.generate()kwargs), orchestrates evaluation, and handles result aggregation on rank 0.combine_partial_results(partial_results): Sorts and flattens per-rank results by example ID.gather_results_from_each_node(num_replicas, save_dir, timeout): Polls for rank JSON files with timeout handling.
The script supports FP16 inference, task-specific model parameters, configurable source/target languages, custom prefixes, and pseudolabel generation (when num_return_sequences > 1).
⚠️ DEPRECATED: This file resides in the legacy/ directory and is not actively maintained. Prefer modern equivalents where available.
Usage
Use this script when:
- Evaluating large seq2seq models where single-GPU evaluation is too slow.
- Computing BLEU scores for translation tasks across multiple GPUs.
- Computing ROUGE scores for summarization tasks with distributed inference.
- Generating pseudolabels from multiple return sequences for knowledge distillation.
Code Reference
Source Location
examples/NLU/examples/legacy/seq2seq/run_distributed_eval.py (261 lines)
Signature
def eval_data_dir(
data_dir,
save_dir: str,
model_name: str,
bs: int = 8,
max_source_length: int = 1024,
type_path: str = "val",
n_obs: int = None,
fp16: bool = False,
task: str = "summarization",
local_rank: int = None,
num_return_sequences: int = 1,
dataset_kwargs: Dict = None,
prefix: str = "",
**generate_kwargs,
) -> Tuple[List[Dict], int]: ...
def run_generate() -> None: ...
def combine_partial_results(partial_results: List) -> List: ...
def gather_results_from_each_node(num_replicas: int, save_dir, timeout: int) -> List[Dict[str, List]]: ...
Import / CLI Usage
# Launch with torch.distributed (e.g., 4 GPUs)
python -m torch.distributed.launch --nproc_per_node=4 \
examples/legacy/seq2seq/run_distributed_eval.py \
--model_name facebook/bart-large-cnn \
--data_dir ./cnn_dm \
--save_dir ./eval_output \
--bs 8 \
--task summarization \
--fp16
# With extra generate kwargs
python -m torch.distributed.launch --nproc_per_node=2 \
examples/legacy/seq2seq/run_distributed_eval.py \
--model_name Helsinki-NLP/opus-mt-en-de \
--data_dir ./wmt_en_de \
--save_dir ./eval_output \
--task translation \
--num_beams=5 --length_penalty=1.0
I/O Contract
Inputs
| Input | Type | Description |
|---|---|---|
--model_name |
str | HuggingFace model name or path. Default: sshleifer/distilbart-xsum-12-3
|
--data_dir |
str | Directory containing {type_path}.source and {type_path}.target files
|
--save_dir |
str | Output directory for metrics and generation files. Default: tmp_gen
|
--type_path |
str | Data split to evaluate: train, val, or test. Default: test
|
--task |
str | Task name (summarization or translation). Default: summarization
|
--bs |
int | Batch size per GPU. Default: 8 |
--local_rank |
int | GPU rank, passed by torch.distributed.launch. Default: -1
|
--fp16 |
flag | Enable half-precision inference |
--sync_timeout |
int | Seconds for rank 0 to wait for other ranks. Default: 600 |
--num_return_sequences |
int | Number of sequences to generate per input. Default: 1 |
--src_lang, --tgt_lang |
str | Optional source/target language codes |
--prefix |
str | Optional prefix prepended to source examples |
| Extra kwargs | varied | Unrecognized args passed to model.generate() (e.g., --num_beams=5)
|
Outputs
| Output | Type | Description |
|---|---|---|
{save_dir}/{type_path}_{metric}.json |
JSON file | Metrics dictionary (BLEU or ROUGE scores, n_obs, seconds_per_sample, n_gpus) |
{save_dir}/{type_path}_generations.txt |
Text file | Generated predictions, one per line |
{save_dir}/pseudolabel_results.json |
JSON file | Multi-sequence results when num_return_sequences > 1
|
{save_dir}_tmp/rank_{rank}_output.json |
JSON files | Intermediate per-rank results (cleaned up unless --debug)
|
Usage Examples
# Distributed ROUGE evaluation of a summarization model on 4 GPUs
python -m torch.distributed.launch --nproc_per_node=4 \
examples/legacy/seq2seq/run_distributed_eval.py \
--model_name facebook/bart-large-cnn \
--data_dir ./cnn_dm \
--save_dir ./eval_results \
--type_path test \
--task summarization \
--bs 16 \
--fp16
# Output metrics (example):
# {'rouge1': 44.16, 'rouge2': 21.28, 'rougeL': 40.90, 'n_obs': 11490,
# 'seconds_per_sample': 0.0821, 'n_gpus': 4}
# Distributed BLEU evaluation for translation
python -m torch.distributed.launch --nproc_per_node=2 \
examples/legacy/seq2seq/run_distributed_eval.py \
--model_name Helsinki-NLP/opus-mt-en-de \
--data_dir ./wmt_en_de \
--save_dir ./translation_eval \
--task translation \
--num_beams=5