Implementation:Recommenders team Recommenders Get Mind Data Set
| Knowledge Sources | |
|---|---|
| Domains | News Recommendation, Dataset Preparation |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Concrete tool for downloading and extracting the MIND (Microsoft News Dataset) files needed by neural news recommendation models.
Description
This implementation consists of two functions that work together:
get_mind_data_set resolves the Azure-hosted URLs and archive names for a given MIND dataset size. It accepts a type parameter ("large", "small", or "demo") and returns a tuple of four strings: the base URL, the training archive name, the validation archive name, and the utilities archive name.
download_deeprec_resources performs the actual download and extraction. It creates the target directory if it does not exist, downloads the specified zip archive from Azure Blob Storage, extracts its contents, and removes the zip file afterward to conserve disk space.
Together, they are called three times per dataset preparation: once each for the training set, validation set, and utilities (embeddings + dictionaries).
Usage
Use these functions at the beginning of a news recommendation pipeline to acquire the MIND dataset. Typically called in a notebook or script before any model configuration or training steps.
Code Reference
Source Location
- Repository: recommenders-team/recommenders
- File (get_mind_data_set):
recommenders/models/newsrec/newsrec_utils.py(lines 300-333) - File (download_deeprec_resources):
recommenders/models/deeprec/deeprec_utils.py(lines 430-444)
Signature
def get_mind_data_set(type: str) -> tuple[str, str, str, str]:
"""Get MIND dataset address.
Args:
type (str): type of mind dataset, must be in ['large', 'small', 'demo']
Returns:
tuple: (url, train_zip, valid_zip, utils_zip)
"""
def download_deeprec_resources(azure_container_url, data_path, remote_resource_name) -> None:
"""Download resources from Azure, extract zip, and clean up.
Args:
azure_container_url (str): URL of Azure container.
data_path (str): Path to download the resources.
remote_resource_name (str): Name of the resource zip file.
"""
Import
from recommenders.models.newsrec.newsrec_utils import get_mind_data_set
from recommenders.models.deeprec.deeprec_utils import download_deeprec_resources
I/O Contract
get_mind_data_set
| Parameter | Type | Description |
|---|---|---|
type |
str |
Dataset size: "large", "small", or "demo"
|
| Return | Type | Description |
|---|---|---|
url |
str |
Base Azure Blob Storage URL for the dataset |
train_zip |
str |
Training archive filename (e.g., MINDsmall_train.zip)
|
valid_zip |
str |
Validation archive filename (e.g., MINDsmall_dev.zip)
|
utils_zip |
str |
Utilities archive filename (e.g., MINDsmall_utils.zip)
|
download_deeprec_resources
| Parameter | Type | Description |
|---|---|---|
azure_container_url |
str |
Base URL of the Azure container |
data_path |
str |
Local directory to store extracted files |
remote_resource_name |
str |
Name of the zip file to download |
| Return | Type | Description |
|---|---|---|
| (none) | None |
Files are extracted to data_path as a side effect
|
Usage Examples
from recommenders.models.newsrec.newsrec_utils import get_mind_data_set
from recommenders.models.deeprec.deeprec_utils import download_deeprec_resources
import os
data_path = "/tmp/mind_data"
mind_type = "demo"
# Step 1: Get dataset URLs and archive names
mind_url, mind_train_dataset, mind_dev_dataset, mind_utils = get_mind_data_set(mind_type)
# Step 2: Download and extract training data
train_path = os.path.join(data_path, "train")
download_deeprec_resources(mind_url, train_path, mind_train_dataset)
# Step 3: Download and extract validation data
valid_path = os.path.join(data_path, "valid")
download_deeprec_resources(mind_url, valid_path, mind_dev_dataset)
# Step 4: Download and extract utilities (embeddings, dictionaries)
utils_path = os.path.join(data_path, "utils")
download_deeprec_resources(mind_url, utils_path, mind_utils)
# Verify expected files
train_news_file = os.path.join(train_path, "news.tsv")
train_behaviors_file = os.path.join(train_path, "behaviors.tsv")
wordEmb_file = os.path.join(utils_path, "embedding.npy")
wordDict_file = os.path.join(utils_path, "word_dict.pkl")
userDict_file = os.path.join(utils_path, "uid2index.pkl")
Dependencies
zipfile— Standard library module for zip extractionos— Standard library module for filesystem operationsrecommenders.utils.download_utils.maybe_download— Utility for downloading files from URLs