Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Microsoft LoRA Download GLUE Data

From Leeroopedia


Overview

Download GLUE Data is an API Doc for the download_glue_data.py script in the microsoft/LoRA repository. This script downloads all nine GLUE benchmark datasets (plus SNLI and a diagnostic set) from Firebase-hosted mirrors and organizes them into a local directory structure of TSV files.

Source File

File Lines Description
examples/NLU/utils/download_glue_data.py 126-154 Main entry point and CLI argument parsing
examples/NLU/utils/download_glue_data.py 47-54 download_and_extract function
examples/NLU/utils/download_glue_data.py 57-101 format_mrpc function

CLI Signature

python utils/download_glue_data.py --data_dir glue_data --tasks all

Arguments

Flag Type Default Description
--data_dir str glue_data Directory to save downloaded data
--tasks str all Comma-separated list of tasks to download (e.g., CoLA,SST,MRPC) or all
--path_to_mrpc str "" Optional path to locally extracted MRPC data directory

Input / Output

Direction Description
Input Internet access to Firebase storage URLs; optionally a local MRPC data directory
Output GLUE TSV datasets organized in data_dir/ with one subdirectory per task

Internal Functions

main(arguments)

Entry point (lines 126-154). Parses CLI arguments, creates the output directory if needed, resolves the task list, and dispatches to task-specific download functions:

def main(arguments):
    parser = argparse.ArgumentParser()
    parser.add_argument("--data_dir", help="directory to save data to",
                        type=str, default="glue_data")
    parser.add_argument("--tasks",
                        help="tasks to download data for as a comma separated string",
                        type=str, default="all")
    parser.add_argument("--path_to_mrpc",
                        help="path to directory containing extracted MRPC data",
                        type=str, default="")
    args = parser.parse_args(arguments)

    if not os.path.isdir(args.data_dir):
        os.mkdir(args.data_dir)
    tasks = get_tasks(args.tasks)

    for task in tasks:
        if task == "MRPC":
            format_mrpc(args.data_dir, args.path_to_mrpc)
        elif task == "diagnostic":
            download_diagnostic(args.data_dir)
        else:
            download_and_extract(task, args.data_dir)

download_and_extract(task, data_dir)

Generic download handler (lines 47-54). Downloads a ZIP file from the TASK2PATH URL mapping, extracts it into data_dir, and removes the temporary ZIP:

def download_and_extract(task, data_dir):
    print("Downloading and extracting %s..." % task)
    data_file = "%s.zip" % task
    urllib.request.urlretrieve(TASK2PATH[task], data_file)
    with zipfile.ZipFile(data_file) as zip_ref:
        zip_ref.extractall(data_dir)
    os.remove(data_file)
    print("\tCompleted!")

format_mrpc(data_dir, path_to_data)

MRPC-specific handler (lines 57-101). Because MRPC cannot be directly distributed as a ZIP, this function:

  • Downloads train and test paraphrase files from Facebook's SentEval mirrors (if no local path is provided)
  • Downloads development set ID pairs from the GLUE Firebase storage
  • Splits the training data into train.tsv and dev.tsv based on development IDs
  • Reformats test.tsv with a simplified header (index, #1 ID, #2 ID, #1 String, #2 String)

get_tasks(task_names)

Task name resolver (lines 114-123). Accepts a comma-separated string and returns a list of validated task names. If "all" is specified, returns the full TASKS list:

TASKS = ["CoLA", "SST", "MRPC", "QQP", "STS", "MNLI",
         "SNLI", "QNLI", "RTE", "WNLI", "diagnostic"]

Task URL Mapping

The script maps each task to its Firebase download URL in the TASK2PATH dictionary (lines 29-41). All URLs point to firebasestorage.googleapis.com except the diagnostic set which uses storage.googleapis.com.

Example Usage

Download only CoLA and SST-2:

python utils/download_glue_data.py --data_dir ./glue_data --tasks CoLA,SST

Download all tasks with a local MRPC source:

python utils/download_glue_data.py \
    --data_dir ./glue_data \
    --tasks all \
    --path_to_mrpc /path/to/extracted/MRPC

Related

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment