Implementation:Recommenders team Recommenders Get Mind Data Set

Knowledge Sources	Recommenders
Domains	News Recommendation, Dataset Preparation
Last Updated	2026-02-10 00:00 GMT

Overview

Concrete tool for downloading and extracting the MIND (Microsoft News Dataset) files needed by neural news recommendation models.

Description

This implementation consists of two functions that work together:

get_mind_data_set resolves the Azure-hosted URLs and archive names for a given MIND dataset size. It accepts a type parameter ("large", "small", or "demo") and returns a tuple of four strings: the base URL, the training archive name, the validation archive name, and the utilities archive name.

download_deeprec_resources performs the actual download and extraction. It creates the target directory if it does not exist, downloads the specified zip archive from Azure Blob Storage, extracts its contents, and removes the zip file afterward to conserve disk space.

Together, they are called three times per dataset preparation: once each for the training set, validation set, and utilities (embeddings + dictionaries).

Usage

Use these functions at the beginning of a news recommendation pipeline to acquire the MIND dataset. Typically called in a notebook or script before any model configuration or training steps.

Code Reference

Source Location

Repository: recommenders-team/recommenders
File (get_mind_data_set): recommenders/models/newsrec/newsrec_utils.py (lines 300-333)
File (download_deeprec_resources): recommenders/models/deeprec/deeprec_utils.py (lines 430-444)

Signature

def get_mind_data_set(type: str) -> tuple[str, str, str, str]:
    """Get MIND dataset address.

    Args:
        type (str): type of mind dataset, must be in ['large', 'small', 'demo']

    Returns:
        tuple: (url, train_zip, valid_zip, utils_zip)
    """

def download_deeprec_resources(azure_container_url, data_path, remote_resource_name) -> None:
    """Download resources from Azure, extract zip, and clean up.

    Args:
        azure_container_url (str): URL of Azure container.
        data_path (str): Path to download the resources.
        remote_resource_name (str): Name of the resource zip file.
    """

Import

from recommenders.models.newsrec.newsrec_utils import get_mind_data_set
from recommenders.models.deeprec.deeprec_utils import download_deeprec_resources

I/O Contract

`get_mind_data_set`

Parameter	Type	Description
`type`	`str`	Dataset size: `"large"`, `"small"`, or `"demo"`

Return	Type	Description
`url`	`str`	Base Azure Blob Storage URL for the dataset
`train_zip`	`str`	Training archive filename (e.g., `MINDsmall_train.zip`)
`valid_zip`	`str`	Validation archive filename (e.g., `MINDsmall_dev.zip`)
`utils_zip`	`str`	Utilities archive filename (e.g., `MINDsmall_utils.zip`)

`download_deeprec_resources`

Parameter	Type	Description
`azure_container_url`	`str`	Base URL of the Azure container
`data_path`	`str`	Local directory to store extracted files
`remote_resource_name`	`str`	Name of the zip file to download

Return	Type	Description
(none)	`None`	Files are extracted to `data_path` as a side effect

Usage Examples

from recommenders.models.newsrec.newsrec_utils import get_mind_data_set
from recommenders.models.deeprec.deeprec_utils import download_deeprec_resources
import os

data_path = "/tmp/mind_data"
mind_type = "demo"

# Step 1: Get dataset URLs and archive names
mind_url, mind_train_dataset, mind_dev_dataset, mind_utils = get_mind_data_set(mind_type)

# Step 2: Download and extract training data
train_path = os.path.join(data_path, "train")
download_deeprec_resources(mind_url, train_path, mind_train_dataset)

# Step 3: Download and extract validation data
valid_path = os.path.join(data_path, "valid")
download_deeprec_resources(mind_url, valid_path, mind_dev_dataset)

# Step 4: Download and extract utilities (embeddings, dictionaries)
utils_path = os.path.join(data_path, "utils")
download_deeprec_resources(mind_url, utils_path, mind_utils)

# Verify expected files
train_news_file = os.path.join(train_path, "news.tsv")
train_behaviors_file = os.path.join(train_path, "behaviors.tsv")
wordEmb_file = os.path.join(utils_path, "embedding.npy")
wordDict_file = os.path.join(utils_path, "word_dict.pkl")
userDict_file = os.path.join(utils_path, "uid2index.pkl")

Dependencies

zipfile — Standard library module for zip extraction
os — Standard library module for filesystem operations
recommenders.utils.download_utils.maybe_download — Utility for downloading files from URLs

Related Pages

Implements Principle

Principle:Recommenders_team_Recommenders_MIND_Dataset_Preparation

Requires Environment

Environment:Recommenders_team_Recommenders_Python_Core_Dependencies

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment