Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Microsoft LoRA Run GLUE Evaluation

From Leeroopedia


Overview

Run GLUE Evaluation is a Wrapper Doc for the evaluation pipeline within run_glue.py from the microsoft/LoRA repository. This documents the LoRA weight loading mechanism, the evaluation loop for GLUE tasks, and the compute_metrics function that computes task-specific scores.

Source File

File Lines Description
examples/NLU/examples/text-classification/run_glue.py 379-385 LoRA weight loading with strict=False
examples/NLU/examples/text-classification/run_glue.py 572-589 Evaluation loop (including MNLI double evaluation)
examples/NLU/examples/text-classification/run_glue.py 515-526 compute_metrics function

CLI Signature

python -m torch.distributed.launch --nproc_per_node=<N> \
    examples/text-classification/run_glue.py \
    --model_name_or_path roberta-base \
    --task_name mnli --do_eval \
    --apply_lora --lora_r 8 --lora_alpha 16 \
    --lora_path ./output/roberta_base_mnli/model/lora_weights.pt

Key Flags

Flag Description
--do_eval Enable evaluation on the validation set
--apply_lora Inject LoRA layers into the model architecture
--lora_r LoRA rank (must match the trained checkpoint)
--lora_alpha LoRA alpha scaling (must match the trained checkpoint)
--lora_path Path to the LoRA weight file (.pt)
--model_name_or_path Base pretrained model identifier
--task_name GLUE task name (determines metric and dataset)

Input / Output

Direction Description
Input Base pretrained model (from HuggingFace Hub) + LoRA checkpoint file (.pt) + GLUE validation dataset
Output Evaluation metrics dictionary (accuracy, F1, MCC, Pearson/Spearman, etc.)

LoRA Weight Loading (Lines 379-385)

When --apply_lora is set and --lora_path is provided, the script loads LoRA weights onto the model:

if model_args.apply_lora:
    if model_args.lora_path is not None:
        lora_state_dict = torch.load(model_args.lora_path)
        logger.info(f"Apply LoRA state dict from {model_args.lora_path}.")
        logger.info(lora_state_dict.keys())
        model.load_state_dict(lora_state_dict, strict=False)
    trainable_params.append('lora')

The strict=False parameter is essential because the LoRA checkpoint contains only:

  • LoRA matrices (lora_A, lora_B for each adapted layer)
  • Classifier head weights (not prefixed with roberta or deberta)

All other keys in the model's state dict (the pretrained backbone) are missing from the checkpoint and are silently retained from from_pretrained().

Evaluation Loop (Lines 572-589)

The evaluation loop iterates over validation datasets, with special handling for MNLI:

if training_args.do_eval:
    logger.info("*** Evaluate ***")

    tasks = [data_args.task_name]
    eval_datasets = [eval_dataset]
    if data_args.task_name == "mnli":
        tasks.append("mnli-mm")
        eval_datasets.append(datasets["validation_mismatched"])

    for eval_dataset, task in zip(eval_datasets, tasks):
        metrics = trainer.evaluate(eval_dataset=eval_dataset)

        max_val_samples = (data_args.max_val_samples
                           if data_args.max_val_samples is not None
                           else len(eval_dataset))
        metrics["eval_samples"] = min(max_val_samples, len(eval_dataset))

        trainer.log_metrics("eval", metrics)
        trainer.save_metrics("eval", metrics)

For MNLI, this produces two separate evaluation result files:

  • eval_results_mnli.json -- Matched validation accuracy
  • eval_results_mnli-mm.json -- Mismatched validation accuracy

compute_metrics Function (Lines 515-526)

def compute_metrics(p: EvalPrediction):
    preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
    preds = np.squeeze(preds) if is_regression else np.argmax(preds, axis=1)
    if data_args.task_name is not None:
        result = metric.compute(predictions=preds, references=p.label_ids)
        if len(result) > 1:
            result["combined_score"] = np.mean(list(result.values())).item()
        return result
    elif is_regression:
        return {"mse": ((preds - p.label_ids) ** 2).mean().item()}
    else:
        return {"accuracy": (preds == p.label_ids).astype(np.float32).mean().item()}

Metric Dispatch by Task

  • CoLA -- metric.compute() returns {"matthews_correlation": float}
  • SST-2, QNLI, RTE, WNLI -- returns {"accuracy": float}
  • MRPC, QQP -- returns {"accuracy": float, "f1": float} plus "combined_score"
  • STS-B -- returns {"pearson": float, "spearmanr": float} plus "combined_score"
  • MNLI -- returns {"accuracy": float} for each of the matched and mismatched sets

Prediction Mode (Lines 591-617)

When --do_predict is used instead of --do_eval, the script generates predictions on the test set and writes them to a TSV file for GLUE leaderboard submission:

predictions = trainer.predict(test_dataset=test_dataset).predictions
predictions = np.squeeze(predictions) if is_regression else np.argmax(predictions, axis=1)

output_test_file = os.path.join(training_args.output_dir, f"test_results_{task}.txt")
with open(output_test_file, "w") as writer:
    writer.write("index\tprediction\n")
    for index, item in enumerate(predictions):
        if is_regression:
            writer.write(f"{index}\t{item:3.3f}\n")
        else:
            item = label_list[item]
            writer.write(f"{index}\t{item}\n")

Related

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment