Implementation:Microsoft LoRA Run GLUE Evaluation

Overview

Run GLUE Evaluation is a Wrapper Doc for the evaluation pipeline within run_glue.py from the microsoft/LoRA repository. This documents the LoRA weight loading mechanism, the evaluation loop for GLUE tasks, and the compute_metrics function that computes task-specific scores.

Source File

File	Lines	Description
`examples/NLU/examples/text-classification/run_glue.py`	379-385	LoRA weight loading with `strict=False`
`examples/NLU/examples/text-classification/run_glue.py`	572-589	Evaluation loop (including MNLI double evaluation)
`examples/NLU/examples/text-classification/run_glue.py`	515-526	`compute_metrics` function

CLI Signature

python -m torch.distributed.launch --nproc_per_node=<N> \
    examples/text-classification/run_glue.py \
    --model_name_or_path roberta-base \
    --task_name mnli --do_eval \
    --apply_lora --lora_r 8 --lora_alpha 16 \
    --lora_path ./output/roberta_base_mnli/model/lora_weights.pt

Key Flags

Flag	Description
`--do_eval`	Enable evaluation on the validation set
`--apply_lora`	Inject LoRA layers into the model architecture
`--lora_r`	LoRA rank (must match the trained checkpoint)
`--lora_alpha`	LoRA alpha scaling (must match the trained checkpoint)
`--lora_path`	Path to the LoRA weight file (`.pt`)
`--model_name_or_path`	Base pretrained model identifier
`--task_name`	GLUE task name (determines metric and dataset)

Input / Output

Direction	Description
Input	Base pretrained model (from HuggingFace Hub) + LoRA checkpoint file (`.pt`) + GLUE validation dataset
Output	Evaluation metrics dictionary (accuracy, F1, MCC, Pearson/Spearman, etc.)

LoRA Weight Loading (Lines 379-385)

When --apply_lora is set and --lora_path is provided, the script loads LoRA weights onto the model:

if model_args.apply_lora:
    if model_args.lora_path is not None:
        lora_state_dict = torch.load(model_args.lora_path)
        logger.info(f"Apply LoRA state dict from {model_args.lora_path}.")
        logger.info(lora_state_dict.keys())
        model.load_state_dict(lora_state_dict, strict=False)
    trainable_params.append('lora')

The strict=False parameter is essential because the LoRA checkpoint contains only:

LoRA matrices (lora_A, lora_B for each adapted layer)
Classifier head weights (not prefixed with roberta or deberta)

All other keys in the model's state dict (the pretrained backbone) are missing from the checkpoint and are silently retained from from_pretrained().

Evaluation Loop (Lines 572-589)

The evaluation loop iterates over validation datasets, with special handling for MNLI:

if training_args.do_eval:
    logger.info("*** Evaluate ***")

    tasks = [data_args.task_name]
    eval_datasets = [eval_dataset]
    if data_args.task_name == "mnli":
        tasks.append("mnli-mm")
        eval_datasets.append(datasets["validation_mismatched"])

    for eval_dataset, task in zip(eval_datasets, tasks):
        metrics = trainer.evaluate(eval_dataset=eval_dataset)

        max_val_samples = (data_args.max_val_samples
                           if data_args.max_val_samples is not None
                           else len(eval_dataset))
        metrics["eval_samples"] = min(max_val_samples, len(eval_dataset))

        trainer.log_metrics("eval", metrics)
        trainer.save_metrics("eval", metrics)

For MNLI, this produces two separate evaluation result files:

eval_results_mnli.json -- Matched validation accuracy
eval_results_mnli-mm.json -- Mismatched validation accuracy

compute_metrics Function (Lines 515-526)

def compute_metrics(p: EvalPrediction):
    preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
    preds = np.squeeze(preds) if is_regression else np.argmax(preds, axis=1)
    if data_args.task_name is not None:
        result = metric.compute(predictions=preds, references=p.label_ids)
        if len(result) > 1:
            result["combined_score"] = np.mean(list(result.values())).item()
        return result
    elif is_regression:
        return {"mse": ((preds - p.label_ids) ** 2).mean().item()}
    else:
        return {"accuracy": (preds == p.label_ids).astype(np.float32).mean().item()}

Metric Dispatch by Task

CoLA -- metric.compute() returns {"matthews_correlation": float}
SST-2, QNLI, RTE, WNLI -- returns {"accuracy": float}
MRPC, QQP -- returns {"accuracy": float, "f1": float} plus "combined_score"
STS-B -- returns {"pearson": float, "spearmanr": float} plus "combined_score"
MNLI -- returns {"accuracy": float} for each of the matched and mismatched sets

Prediction Mode (Lines 591-617)

When --do_predict is used instead of --do_eval, the script generates predictions on the test set and writes them to a TSV file for GLUE leaderboard submission:

predictions = trainer.predict(test_dataset=test_dataset).predictions
predictions = np.squeeze(predictions) if is_regression else np.argmax(predictions, axis=1)

output_test_file = os.path.join(training_args.output_dir, f"test_results_{task}.txt")
with open(output_test_file, "w") as writer:
    writer.write("index\tprediction\n")
    for index, item in enumerate(predictions):
        if is_regression:
            writer.write(f"{index}\t{item:3.3f}\n")
        else:
            item = label_list[item]
            writer.write(f"{index}\t{item}\n")

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment