Principle:Microsoft LoRA GLUE Data Download

Overview

GLUE Data Download describes the process of downloading and preparing the GLUE (General Language Understanding Evaluation) benchmark datasets for NLU evaluation in the microsoft/LoRA repository. The GLUE benchmark (Wang et al., 2018) is the primary evaluation suite used to demonstrate LoRA's effectiveness on natural language understanding tasks.

The download script retrieves all nine GLUE tasks from the Firebase-hosted MTL Sentence Representations storage, extracting them into a local directory structure that is compatible with both the HuggingFace Datasets library and direct TSV file loading.

The Nine GLUE Tasks

The GLUE benchmark comprises nine sentence- or sentence-pair classification tasks:

Single-Sentence Tasks

CoLA (Corpus of Linguistic Acceptability) -- Binary classification of whether an English sentence is grammatically acceptable. Evaluated using Matthews Correlation Coefficient (MCC). Approximately 8.5K training examples.
SST-2 (Stanford Sentiment Treebank) -- Binary sentiment classification of movie review sentences as positive or negative. Evaluated using accuracy. Approximately 67K training examples.

Similarity and Paraphrase Tasks

MRPC (Microsoft Research Paraphrase Corpus) -- Binary classification of whether two sentences are semantically equivalent. Evaluated using accuracy and F1 score. Approximately 3.7K training examples.
QQP (Quora Question Pairs) -- Binary classification of whether two questions are semantically equivalent. Evaluated using accuracy and F1 score. Approximately 364K training examples.
STS-B (Semantic Textual Similarity Benchmark) -- Regression task predicting similarity scores between 1 and 5 for sentence pairs. Evaluated using Pearson correlation and Spearman correlation. Approximately 5.7K training examples.

Natural Language Inference Tasks

MNLI (Multi-Genre Natural Language Inference) -- Three-way classification (entailment, contradiction, neutral) of premise-hypothesis pairs drawn from ten genres. Evaluated using matched accuracy and mismatched accuracy (in-domain and cross-domain). Approximately 393K training examples.
QNLI (Question Natural Language Inference) -- Binary classification of whether a context sentence contains the answer to a question. Derived from the Stanford Question Answering Dataset. Evaluated using accuracy. Approximately 105K training examples.
RTE (Recognizing Textual Entailment) -- Binary entailment classification, aggregated from RTE1, RTE2, RTE3, and RTE5 challenges. Evaluated using accuracy. Approximately 2.5K training examples.
WNLI (Winograd Natural Language Inference) -- Binary entailment classification based on Winograd Schema Challenge sentences. Evaluated using accuracy. Approximately 634 training examples.

Data Format

All GLUE tasks are distributed as TSV (tab-separated values) files with the following structure:

A header row naming the columns
One example per line, with fields separated by tabs
Label column containing the classification target
One or two sentence columns depending on the task

The download script creates a directory per task under the specified data_dir:

glue_data/
  CoLA/
    train.tsv
    dev.tsv
    test.tsv
  SST-2/
    train.tsv
    dev.tsv
    test.tsv
  MRPC/
    train.tsv
    dev.tsv
    test.tsv
  ...

MRPC Special Handling

The MRPC dataset requires special treatment because its original source is a Microsoft MSI installer. The download script:

Downloads the paraphrase training and test data from Facebook's SentEval mirrors
Downloads development set IDs from the Firebase GLUE storage
Splits the training data into train and dev sets based on the development IDs
Reformats the test file with a simplified header

If a local copy of MRPC data is available (from manual extraction of the MSI file), it can be provided via the --path_to_mrpc flag.

Usage in LoRA Experiments

In the LoRA NLU workflow, the GLUE data is typically downloaded once and then accessed via the HuggingFace datasets library in run_glue.py:

from datasets import load_dataset
datasets = load_dataset("glue", data_args.task_name)

The load_dataset call downloads from the HuggingFace Hub by default, but the local download script provides an alternative for offline or air-gapped environments.

Metadata

Field	Value
Source	Repo (microsoft/LoRA)
Domains	Data, NLU
Related	Implementation:Microsoft_LoRA_Download_GLUE_Data

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment