Implementation:Microsoft LoRA Run GLUE Evaluation
Overview
Run GLUE Evaluation is a Wrapper Doc for the evaluation pipeline within run_glue.py from the microsoft/LoRA repository. This documents the LoRA weight loading mechanism, the evaluation loop for GLUE tasks, and the compute_metrics function that computes task-specific scores.
Source File
| File | Lines | Description |
|---|---|---|
examples/NLU/examples/text-classification/run_glue.py |
379-385 | LoRA weight loading with strict=False
|
examples/NLU/examples/text-classification/run_glue.py |
572-589 | Evaluation loop (including MNLI double evaluation) |
examples/NLU/examples/text-classification/run_glue.py |
515-526 | compute_metrics function
|
CLI Signature
python -m torch.distributed.launch --nproc_per_node=<N> \
examples/text-classification/run_glue.py \
--model_name_or_path roberta-base \
--task_name mnli --do_eval \
--apply_lora --lora_r 8 --lora_alpha 16 \
--lora_path ./output/roberta_base_mnli/model/lora_weights.pt
Key Flags
| Flag | Description |
|---|---|
--do_eval |
Enable evaluation on the validation set |
--apply_lora |
Inject LoRA layers into the model architecture |
--lora_r |
LoRA rank (must match the trained checkpoint) |
--lora_alpha |
LoRA alpha scaling (must match the trained checkpoint) |
--lora_path |
Path to the LoRA weight file (.pt)
|
--model_name_or_path |
Base pretrained model identifier |
--task_name |
GLUE task name (determines metric and dataset) |
Input / Output
| Direction | Description |
|---|---|
| Input | Base pretrained model (from HuggingFace Hub) + LoRA checkpoint file (.pt) + GLUE validation dataset
|
| Output | Evaluation metrics dictionary (accuracy, F1, MCC, Pearson/Spearman, etc.) |
LoRA Weight Loading (Lines 379-385)
When --apply_lora is set and --lora_path is provided, the script loads LoRA weights onto the model:
if model_args.apply_lora:
if model_args.lora_path is not None:
lora_state_dict = torch.load(model_args.lora_path)
logger.info(f"Apply LoRA state dict from {model_args.lora_path}.")
logger.info(lora_state_dict.keys())
model.load_state_dict(lora_state_dict, strict=False)
trainable_params.append('lora')
The strict=False parameter is essential because the LoRA checkpoint contains only:
- LoRA matrices (
lora_A,lora_Bfor each adapted layer) - Classifier head weights (not prefixed with
robertaordeberta)
All other keys in the model's state dict (the pretrained backbone) are missing from the checkpoint and are silently retained from from_pretrained().
Evaluation Loop (Lines 572-589)
The evaluation loop iterates over validation datasets, with special handling for MNLI:
if training_args.do_eval:
logger.info("*** Evaluate ***")
tasks = [data_args.task_name]
eval_datasets = [eval_dataset]
if data_args.task_name == "mnli":
tasks.append("mnli-mm")
eval_datasets.append(datasets["validation_mismatched"])
for eval_dataset, task in zip(eval_datasets, tasks):
metrics = trainer.evaluate(eval_dataset=eval_dataset)
max_val_samples = (data_args.max_val_samples
if data_args.max_val_samples is not None
else len(eval_dataset))
metrics["eval_samples"] = min(max_val_samples, len(eval_dataset))
trainer.log_metrics("eval", metrics)
trainer.save_metrics("eval", metrics)
For MNLI, this produces two separate evaluation result files:
eval_results_mnli.json-- Matched validation accuracyeval_results_mnli-mm.json-- Mismatched validation accuracy
compute_metrics Function (Lines 515-526)
def compute_metrics(p: EvalPrediction):
preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
preds = np.squeeze(preds) if is_regression else np.argmax(preds, axis=1)
if data_args.task_name is not None:
result = metric.compute(predictions=preds, references=p.label_ids)
if len(result) > 1:
result["combined_score"] = np.mean(list(result.values())).item()
return result
elif is_regression:
return {"mse": ((preds - p.label_ids) ** 2).mean().item()}
else:
return {"accuracy": (preds == p.label_ids).astype(np.float32).mean().item()}
Metric Dispatch by Task
- CoLA --
metric.compute()returns{"matthews_correlation": float} - SST-2, QNLI, RTE, WNLI -- returns
{"accuracy": float} - MRPC, QQP -- returns
{"accuracy": float, "f1": float}plus"combined_score" - STS-B -- returns
{"pearson": float, "spearmanr": float}plus"combined_score" - MNLI -- returns
{"accuracy": float}for each of the matched and mismatched sets
Prediction Mode (Lines 591-617)
When --do_predict is used instead of --do_eval, the script generates predictions on the test set and writes them to a TSV file for GLUE leaderboard submission:
predictions = trainer.predict(test_dataset=test_dataset).predictions
predictions = np.squeeze(predictions) if is_regression else np.argmax(predictions, axis=1)
output_test_file = os.path.join(training_args.output_dir, f"test_results_{task}.txt")
with open(output_test_file, "w") as writer:
writer.write("index\tprediction\n")
for index, item in enumerate(predictions):
if is_regression:
writer.write(f"{index}\t{item:3.3f}\n")
else:
item = label_list[item]
writer.write(f"{index}\t{item}\n")