Implementation:Microsoft LoRA Download GLUE Data
Overview
Download GLUE Data is an API Doc for the download_glue_data.py script in the microsoft/LoRA repository. This script downloads all nine GLUE benchmark datasets (plus SNLI and a diagnostic set) from Firebase-hosted mirrors and organizes them into a local directory structure of TSV files.
Source File
| File | Lines | Description |
|---|---|---|
examples/NLU/utils/download_glue_data.py |
126-154 | Main entry point and CLI argument parsing |
examples/NLU/utils/download_glue_data.py |
47-54 | download_and_extract function
|
examples/NLU/utils/download_glue_data.py |
57-101 | format_mrpc function
|
CLI Signature
python utils/download_glue_data.py --data_dir glue_data --tasks all
Arguments
| Flag | Type | Default | Description |
|---|---|---|---|
--data_dir |
str | glue_data |
Directory to save downloaded data |
--tasks |
str | all |
Comma-separated list of tasks to download (e.g., CoLA,SST,MRPC) or all
|
--path_to_mrpc |
str | "" |
Optional path to locally extracted MRPC data directory |
Input / Output
| Direction | Description |
|---|---|
| Input | Internet access to Firebase storage URLs; optionally a local MRPC data directory |
| Output | GLUE TSV datasets organized in data_dir/ with one subdirectory per task
|
Internal Functions
main(arguments)
Entry point (lines 126-154). Parses CLI arguments, creates the output directory if needed, resolves the task list, and dispatches to task-specific download functions:
def main(arguments):
parser = argparse.ArgumentParser()
parser.add_argument("--data_dir", help="directory to save data to",
type=str, default="glue_data")
parser.add_argument("--tasks",
help="tasks to download data for as a comma separated string",
type=str, default="all")
parser.add_argument("--path_to_mrpc",
help="path to directory containing extracted MRPC data",
type=str, default="")
args = parser.parse_args(arguments)
if not os.path.isdir(args.data_dir):
os.mkdir(args.data_dir)
tasks = get_tasks(args.tasks)
for task in tasks:
if task == "MRPC":
format_mrpc(args.data_dir, args.path_to_mrpc)
elif task == "diagnostic":
download_diagnostic(args.data_dir)
else:
download_and_extract(task, args.data_dir)
download_and_extract(task, data_dir)
Generic download handler (lines 47-54). Downloads a ZIP file from the TASK2PATH URL mapping, extracts it into data_dir, and removes the temporary ZIP:
def download_and_extract(task, data_dir):
print("Downloading and extracting %s..." % task)
data_file = "%s.zip" % task
urllib.request.urlretrieve(TASK2PATH[task], data_file)
with zipfile.ZipFile(data_file) as zip_ref:
zip_ref.extractall(data_dir)
os.remove(data_file)
print("\tCompleted!")
format_mrpc(data_dir, path_to_data)
MRPC-specific handler (lines 57-101). Because MRPC cannot be directly distributed as a ZIP, this function:
- Downloads train and test paraphrase files from Facebook's SentEval mirrors (if no local path is provided)
- Downloads development set ID pairs from the GLUE Firebase storage
- Splits the training data into
train.tsvanddev.tsvbased on development IDs - Reformats
test.tsvwith a simplified header (index, #1 ID, #2 ID, #1 String, #2 String)
get_tasks(task_names)
Task name resolver (lines 114-123). Accepts a comma-separated string and returns a list of validated task names. If "all" is specified, returns the full TASKS list:
TASKS = ["CoLA", "SST", "MRPC", "QQP", "STS", "MNLI",
"SNLI", "QNLI", "RTE", "WNLI", "diagnostic"]
Task URL Mapping
The script maps each task to its Firebase download URL in the TASK2PATH dictionary (lines 29-41). All URLs point to firebasestorage.googleapis.com except the diagnostic set which uses storage.googleapis.com.
Example Usage
Download only CoLA and SST-2:
python utils/download_glue_data.py --data_dir ./glue_data --tasks CoLA,SST
Download all tasks with a local MRPC source:
python utils/download_glue_data.py \
--data_dir ./glue_data \
--tasks all \
--path_to_mrpc /path/to/extracted/MRPC