Principle:Microsoft LoRA GLUE Data Download
Overview
GLUE Data Download describes the process of downloading and preparing the GLUE (General Language Understanding Evaluation) benchmark datasets for NLU evaluation in the microsoft/LoRA repository. The GLUE benchmark (Wang et al., 2018) is the primary evaluation suite used to demonstrate LoRA's effectiveness on natural language understanding tasks.
The download script retrieves all nine GLUE tasks from the Firebase-hosted MTL Sentence Representations storage, extracting them into a local directory structure that is compatible with both the HuggingFace Datasets library and direct TSV file loading.
The Nine GLUE Tasks
The GLUE benchmark comprises nine sentence- or sentence-pair classification tasks:
Single-Sentence Tasks
- CoLA (Corpus of Linguistic Acceptability) -- Binary classification of whether an English sentence is grammatically acceptable. Evaluated using Matthews Correlation Coefficient (MCC). Approximately 8.5K training examples.
- SST-2 (Stanford Sentiment Treebank) -- Binary sentiment classification of movie review sentences as positive or negative. Evaluated using accuracy. Approximately 67K training examples.
Similarity and Paraphrase Tasks
- MRPC (Microsoft Research Paraphrase Corpus) -- Binary classification of whether two sentences are semantically equivalent. Evaluated using accuracy and F1 score. Approximately 3.7K training examples.
- QQP (Quora Question Pairs) -- Binary classification of whether two questions are semantically equivalent. Evaluated using accuracy and F1 score. Approximately 364K training examples.
- STS-B (Semantic Textual Similarity Benchmark) -- Regression task predicting similarity scores between 1 and 5 for sentence pairs. Evaluated using Pearson correlation and Spearman correlation. Approximately 5.7K training examples.
Natural Language Inference Tasks
- MNLI (Multi-Genre Natural Language Inference) -- Three-way classification (entailment, contradiction, neutral) of premise-hypothesis pairs drawn from ten genres. Evaluated using matched accuracy and mismatched accuracy (in-domain and cross-domain). Approximately 393K training examples.
- QNLI (Question Natural Language Inference) -- Binary classification of whether a context sentence contains the answer to a question. Derived from the Stanford Question Answering Dataset. Evaluated using accuracy. Approximately 105K training examples.
- RTE (Recognizing Textual Entailment) -- Binary entailment classification, aggregated from RTE1, RTE2, RTE3, and RTE5 challenges. Evaluated using accuracy. Approximately 2.5K training examples.
- WNLI (Winograd Natural Language Inference) -- Binary entailment classification based on Winograd Schema Challenge sentences. Evaluated using accuracy. Approximately 634 training examples.
Data Format
All GLUE tasks are distributed as TSV (tab-separated values) files with the following structure:
- A header row naming the columns
- One example per line, with fields separated by tabs
- Label column containing the classification target
- One or two sentence columns depending on the task
The download script creates a directory per task under the specified data_dir:
glue_data/
CoLA/
train.tsv
dev.tsv
test.tsv
SST-2/
train.tsv
dev.tsv
test.tsv
MRPC/
train.tsv
dev.tsv
test.tsv
...
MRPC Special Handling
The MRPC dataset requires special treatment because its original source is a Microsoft MSI installer. The download script:
- Downloads the paraphrase training and test data from Facebook's SentEval mirrors
- Downloads development set IDs from the Firebase GLUE storage
- Splits the training data into train and dev sets based on the development IDs
- Reformats the test file with a simplified header
If a local copy of MRPC data is available (from manual extraction of the MSI file), it can be provided via the --path_to_mrpc flag.
Usage in LoRA Experiments
In the LoRA NLU workflow, the GLUE data is typically downloaded once and then accessed via the HuggingFace datasets library in run_glue.py:
from datasets import load_dataset
datasets = load_dataset("glue", data_args.task_name)
The load_dataset call downloads from the HuggingFace Hub by default, but the local download script provides an alternative for offline or air-gapped environments.
Metadata
| Field | Value |
|---|---|
| Source | Repo (microsoft/LoRA) |
| Domains | Data, NLU |
| Related | Implementation:Microsoft_LoRA_Download_GLUE_Data |