Principle:Microsoft LoRA LoRA Checkpoint Evaluation
Overview
LoRA Checkpoint Evaluation describes the process of evaluating trained LoRA checkpoints on GLUE validation sets using the run_glue.py script from the microsoft/LoRA repository. Evaluation involves loading the pretrained base model, injecting LoRA architecture via configuration flags, loading LoRA-specific weights with strict=False, and computing task-specific metrics.
Loading Base Model + LoRA Weights
Evaluation of a LoRA checkpoint requires two distinct weight sources:
- Base model weights -- The original pretrained model (e.g.,
roberta-base) downloaded from HuggingFace Hub, providing the frozen backbone parameters - LoRA weights -- The trained low-rank adaptation matrices saved during training, loaded via
--lora_path
The loading process works as follows:
AutoModelForSequenceClassification.from_pretrained()loads the full pretrained model with LoRA architecture injected (controlled byapply_lora=Truein the config)model.load_state_dict(lora_state_dict, strict=False)overlays the LoRA weights onto the model
The strict=False flag is critical because the LoRA checkpoint contains only a subset of the model's parameters (LoRA matrices plus the classifier head). Without strict=False, PyTorch would raise an error about missing keys for all the pretrained backbone parameters.
Task-Specific Metrics
Each GLUE task uses a different evaluation metric, loaded from the HuggingFace datasets library:
| Task | Metric(s) | Type |
|---|---|---|
| CoLA | Matthews Correlation Coefficient (MCC) | Single metric |
| SST-2 | Accuracy | Single metric |
| MRPC | Accuracy + F1 | Combined score (average) |
| QQP | Accuracy + F1 | Combined score (average) |
| STS-B | Pearson Correlation + Spearman Correlation | Combined score (average) |
| MNLI | Matched Accuracy + Mismatched Accuracy | Two separate evaluations |
| QNLI | Accuracy | Single metric |
| RTE | Accuracy | Single metric |
| WNLI | Accuracy | Single metric |
Metric Computation
The compute_metrics function dispatches to the appropriate metric:
def compute_metrics(p: EvalPrediction):
preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
preds = np.squeeze(preds) if is_regression else np.argmax(preds, axis=1)
if data_args.task_name is not None:
result = metric.compute(predictions=preds, references=p.label_ids)
if len(result) > 1:
result["combined_score"] = np.mean(list(result.values())).item()
return result
For tasks with multiple metrics (MRPC, QQP, STS-B), a combined_score is computed as the arithmetic mean of all metric values.
MNLI Double Evaluation
MNLI is unique among GLUE tasks because it has two validation sets:
- validation_matched -- Examples from the same genres as the training data
- validation_mismatched -- Examples from different genres (cross-domain generalization)
The evaluation loop handles this by appending a second evaluation dataset:
tasks = [data_args.task_name]
eval_datasets = [eval_dataset]
if data_args.task_name == "mnli":
tasks.append("mnli-mm")
eval_datasets.append(datasets["validation_mismatched"])
for eval_dataset, task in zip(eval_datasets, tasks):
metrics = trainer.evaluate(eval_dataset=eval_dataset)
Evaluation Output
The Trainer's evaluate() method returns a metrics dictionary and writes results to:
- Console logs -- Via
trainer.log_metrics("eval", metrics) - JSON file -- Via
trainer.save_metrics("eval", metrics), saved tooutput_dir/eval_results.json
A typical evaluation output for MNLI looks like:
***** eval metrics *****
eval_accuracy = 0.8751
eval_loss = 0.3824
eval_samples = 9815
Evaluation vs. Prediction
The script distinguishes between:
- Evaluation (
--do_eval) -- Computes metrics against gold labels on the validation set - Prediction (
--do_predict) -- Generates predictions on the test set (no gold labels available; writes predictions totest_results_<task>.txtfor submission to the GLUE leaderboard)
Metadata
| Field | Value |
|---|---|
| Source | Repo (microsoft/LoRA) |
| Domains | Evaluation, NLU, LoRA |
| Related | Implementation:Microsoft_LoRA_Run_GLUE_Evaluation |