Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Microsoft LoRA LoRA Checkpoint Evaluation

From Leeroopedia


Overview

LoRA Checkpoint Evaluation describes the process of evaluating trained LoRA checkpoints on GLUE validation sets using the run_glue.py script from the microsoft/LoRA repository. Evaluation involves loading the pretrained base model, injecting LoRA architecture via configuration flags, loading LoRA-specific weights with strict=False, and computing task-specific metrics.

Loading Base Model + LoRA Weights

Evaluation of a LoRA checkpoint requires two distinct weight sources:

  • Base model weights -- The original pretrained model (e.g., roberta-base) downloaded from HuggingFace Hub, providing the frozen backbone parameters
  • LoRA weights -- The trained low-rank adaptation matrices saved during training, loaded via --lora_path

The loading process works as follows:

  • AutoModelForSequenceClassification.from_pretrained() loads the full pretrained model with LoRA architecture injected (controlled by apply_lora=True in the config)
  • model.load_state_dict(lora_state_dict, strict=False) overlays the LoRA weights onto the model

The strict=False flag is critical because the LoRA checkpoint contains only a subset of the model's parameters (LoRA matrices plus the classifier head). Without strict=False, PyTorch would raise an error about missing keys for all the pretrained backbone parameters.

Task-Specific Metrics

Each GLUE task uses a different evaluation metric, loaded from the HuggingFace datasets library:

Task Metric(s) Type
CoLA Matthews Correlation Coefficient (MCC) Single metric
SST-2 Accuracy Single metric
MRPC Accuracy + F1 Combined score (average)
QQP Accuracy + F1 Combined score (average)
STS-B Pearson Correlation + Spearman Correlation Combined score (average)
MNLI Matched Accuracy + Mismatched Accuracy Two separate evaluations
QNLI Accuracy Single metric
RTE Accuracy Single metric
WNLI Accuracy Single metric

Metric Computation

The compute_metrics function dispatches to the appropriate metric:

def compute_metrics(p: EvalPrediction):
    preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
    preds = np.squeeze(preds) if is_regression else np.argmax(preds, axis=1)
    if data_args.task_name is not None:
        result = metric.compute(predictions=preds, references=p.label_ids)
        if len(result) > 1:
            result["combined_score"] = np.mean(list(result.values())).item()
        return result

For tasks with multiple metrics (MRPC, QQP, STS-B), a combined_score is computed as the arithmetic mean of all metric values.

MNLI Double Evaluation

MNLI is unique among GLUE tasks because it has two validation sets:

  • validation_matched -- Examples from the same genres as the training data
  • validation_mismatched -- Examples from different genres (cross-domain generalization)

The evaluation loop handles this by appending a second evaluation dataset:

tasks = [data_args.task_name]
eval_datasets = [eval_dataset]
if data_args.task_name == "mnli":
    tasks.append("mnli-mm")
    eval_datasets.append(datasets["validation_mismatched"])

for eval_dataset, task in zip(eval_datasets, tasks):
    metrics = trainer.evaluate(eval_dataset=eval_dataset)

Evaluation Output

The Trainer's evaluate() method returns a metrics dictionary and writes results to:

  • Console logs -- Via trainer.log_metrics("eval", metrics)
  • JSON file -- Via trainer.save_metrics("eval", metrics), saved to output_dir/eval_results.json

A typical evaluation output for MNLI looks like:

***** eval metrics *****
  eval_accuracy           =     0.8751
  eval_loss               =     0.3824
  eval_samples            =      9815

Evaluation vs. Prediction

The script distinguishes between:

  • Evaluation (--do_eval) -- Computes metrics against gold labels on the validation set
  • Prediction (--do_predict) -- Generates predictions on the test set (no gold labels available; writes predictions to test_results_<task>.txt for submission to the GLUE leaderboard)

Metadata

Field Value
Source Repo (microsoft/LoRA)
Domains Evaluation, NLU, LoRA
Related Implementation:Microsoft_LoRA_Run_GLUE_Evaluation

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment