Principle:Microsoft LoRA LoRA Checkpoint Evaluation

Overview

LoRA Checkpoint Evaluation describes the process of evaluating trained LoRA checkpoints on GLUE validation sets using the run_glue.py script from the microsoft/LoRA repository. Evaluation involves loading the pretrained base model, injecting LoRA architecture via configuration flags, loading LoRA-specific weights with strict=False, and computing task-specific metrics.

Loading Base Model + LoRA Weights

Evaluation of a LoRA checkpoint requires two distinct weight sources:

Base model weights -- The original pretrained model (e.g., roberta-base) downloaded from HuggingFace Hub, providing the frozen backbone parameters
LoRA weights -- The trained low-rank adaptation matrices saved during training, loaded via --lora_path

The loading process works as follows:

AutoModelForSequenceClassification.from_pretrained() loads the full pretrained model with LoRA architecture injected (controlled by apply_lora=True in the config)
model.load_state_dict(lora_state_dict, strict=False) overlays the LoRA weights onto the model

The strict=False flag is critical because the LoRA checkpoint contains only a subset of the model's parameters (LoRA matrices plus the classifier head). Without strict=False, PyTorch would raise an error about missing keys for all the pretrained backbone parameters.

Task-Specific Metrics

Each GLUE task uses a different evaluation metric, loaded from the HuggingFace datasets library:

Task	Metric(s)	Type
CoLA	Matthews Correlation Coefficient (MCC)	Single metric
SST-2	Accuracy	Single metric
MRPC	Accuracy + F1	Combined score (average)
QQP	Accuracy + F1	Combined score (average)
STS-B	Pearson Correlation + Spearman Correlation	Combined score (average)
MNLI	Matched Accuracy + Mismatched Accuracy	Two separate evaluations
QNLI	Accuracy	Single metric
RTE	Accuracy	Single metric
WNLI	Accuracy	Single metric

Metric Computation

The compute_metrics function dispatches to the appropriate metric:

def compute_metrics(p: EvalPrediction):
    preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
    preds = np.squeeze(preds) if is_regression else np.argmax(preds, axis=1)
    if data_args.task_name is not None:
        result = metric.compute(predictions=preds, references=p.label_ids)
        if len(result) > 1:
            result["combined_score"] = np.mean(list(result.values())).item()
        return result

For tasks with multiple metrics (MRPC, QQP, STS-B), a combined_score is computed as the arithmetic mean of all metric values.

MNLI Double Evaluation

MNLI is unique among GLUE tasks because it has two validation sets:

validation_matched -- Examples from the same genres as the training data
validation_mismatched -- Examples from different genres (cross-domain generalization)

The evaluation loop handles this by appending a second evaluation dataset:

tasks = [data_args.task_name]
eval_datasets = [eval_dataset]
if data_args.task_name == "mnli":
    tasks.append("mnli-mm")
    eval_datasets.append(datasets["validation_mismatched"])

for eval_dataset, task in zip(eval_datasets, tasks):
    metrics = trainer.evaluate(eval_dataset=eval_dataset)

Evaluation Output

The Trainer's evaluate() method returns a metrics dictionary and writes results to:

Console logs -- Via trainer.log_metrics("eval", metrics)
JSON file -- Via trainer.save_metrics("eval", metrics), saved to output_dir/eval_results.json

A typical evaluation output for MNLI looks like:

***** eval metrics *****
  eval_accuracy           =     0.8751
  eval_loss               =     0.3824
  eval_samples            =      9815

Evaluation vs. Prediction

The script distinguishes between:

Evaluation (--do_eval) -- Computes metrics against gold labels on the validation set
Prediction (--do_predict) -- Generates predictions on the test set (no gold labels available; writes predictions to test_results_<task>.txt for submission to the GLUE leaderboard)

Metadata

Field	Value
Source	Repo (microsoft/LoRA)
Domains	Evaluation, NLU, LoRA
Related	Implementation:Microsoft_LoRA_Run_GLUE_Evaluation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment