Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Recommenders team Recommenders Get Mind Data Set

From Leeroopedia


Knowledge Sources
Domains News Recommendation, Dataset Preparation
Last Updated 2026-02-10 00:00 GMT

Overview

Concrete tool for downloading and extracting the MIND (Microsoft News Dataset) files needed by neural news recommendation models.

Description

This implementation consists of two functions that work together:

get_mind_data_set resolves the Azure-hosted URLs and archive names for a given MIND dataset size. It accepts a type parameter ("large", "small", or "demo") and returns a tuple of four strings: the base URL, the training archive name, the validation archive name, and the utilities archive name.

download_deeprec_resources performs the actual download and extraction. It creates the target directory if it does not exist, downloads the specified zip archive from Azure Blob Storage, extracts its contents, and removes the zip file afterward to conserve disk space.

Together, they are called three times per dataset preparation: once each for the training set, validation set, and utilities (embeddings + dictionaries).

Usage

Use these functions at the beginning of a news recommendation pipeline to acquire the MIND dataset. Typically called in a notebook or script before any model configuration or training steps.

Code Reference

Source Location

  • Repository: recommenders-team/recommenders
  • File (get_mind_data_set): recommenders/models/newsrec/newsrec_utils.py (lines 300-333)
  • File (download_deeprec_resources): recommenders/models/deeprec/deeprec_utils.py (lines 430-444)

Signature

def get_mind_data_set(type: str) -> tuple[str, str, str, str]:
    """Get MIND dataset address.

    Args:
        type (str): type of mind dataset, must be in ['large', 'small', 'demo']

    Returns:
        tuple: (url, train_zip, valid_zip, utils_zip)
    """
def download_deeprec_resources(azure_container_url, data_path, remote_resource_name) -> None:
    """Download resources from Azure, extract zip, and clean up.

    Args:
        azure_container_url (str): URL of Azure container.
        data_path (str): Path to download the resources.
        remote_resource_name (str): Name of the resource zip file.
    """

Import

from recommenders.models.newsrec.newsrec_utils import get_mind_data_set
from recommenders.models.deeprec.deeprec_utils import download_deeprec_resources

I/O Contract

get_mind_data_set

Parameter Type Description
type str Dataset size: "large", "small", or "demo"
Return Type Description
url str Base Azure Blob Storage URL for the dataset
train_zip str Training archive filename (e.g., MINDsmall_train.zip)
valid_zip str Validation archive filename (e.g., MINDsmall_dev.zip)
utils_zip str Utilities archive filename (e.g., MINDsmall_utils.zip)

download_deeprec_resources

Parameter Type Description
azure_container_url str Base URL of the Azure container
data_path str Local directory to store extracted files
remote_resource_name str Name of the zip file to download
Return Type Description
(none) None Files are extracted to data_path as a side effect

Usage Examples

from recommenders.models.newsrec.newsrec_utils import get_mind_data_set
from recommenders.models.deeprec.deeprec_utils import download_deeprec_resources
import os

data_path = "/tmp/mind_data"
mind_type = "demo"

# Step 1: Get dataset URLs and archive names
mind_url, mind_train_dataset, mind_dev_dataset, mind_utils = get_mind_data_set(mind_type)

# Step 2: Download and extract training data
train_path = os.path.join(data_path, "train")
download_deeprec_resources(mind_url, train_path, mind_train_dataset)

# Step 3: Download and extract validation data
valid_path = os.path.join(data_path, "valid")
download_deeprec_resources(mind_url, valid_path, mind_dev_dataset)

# Step 4: Download and extract utilities (embeddings, dictionaries)
utils_path = os.path.join(data_path, "utils")
download_deeprec_resources(mind_url, utils_path, mind_utils)

# Verify expected files
train_news_file = os.path.join(train_path, "news.tsv")
train_behaviors_file = os.path.join(train_path, "behaviors.tsv")
wordEmb_file = os.path.join(utils_path, "embedding.npy")
wordDict_file = os.path.join(utils_path, "word_dict.pkl")
userDict_file = os.path.join(utils_path, "uid2index.pkl")

Dependencies

  • zipfile — Standard library module for zip extraction
  • os — Standard library module for filesystem operations
  • recommenders.utils.download_utils.maybe_download — Utility for downloading files from URLs

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment