Principle:Recommenders team Recommenders MIND Dataset Preparation

Knowledge Sources	Recommenders MIND: A Large-scale Dataset for News Recommendation
Domains	News Recommendation, Dataset Preparation, NLP
Last Updated	2026-02-10 00:00 GMT

Overview

Preparing the Microsoft News Dataset (MIND) for news recommendation involves downloading and extracting news articles, user behavior logs, pre-trained word embeddings, and auxiliary dictionaries required by neural news recommender models.

Description

The Microsoft News Dataset (MIND) provides large-scale click logs collected from Microsoft News for training and evaluating news recommendation systems. The dataset comes in three sizes:

MIND Large — Full-scale dataset for production-level experiments.
MIND Small — A smaller subset for rapid prototyping and development.
MIND Demo — A minimal demo set for quick testing and tutorials.

Each dataset variant contains:

Training set — News articles and user click behaviors for model training (e.g., MINDsmall_train.zip).
Validation set — Held-out behaviors for evaluation during training (e.g., MINDsmall_dev.zip).
Utilities — Pre-trained GloVe word embeddings (embedding.npy), word dictionaries (word_dict.pkl), and user dictionaries (uid2index.pkl) packaged in a utils archive (e.g., MINDsmall_utils.zip).

The preparation workflow consists of:

Selecting the dataset size (large, small, or demo) via get_mind_data_set.
Downloading each zip archive from the Azure-hosted repository using download_deeprec_resources.
Extracting the archives into a local data directory.
Verifying that the expected files (news.tsv, behaviors.tsv, embedding.npy, word_dict.pkl, uid2index.pkl) are present.

Usage

Use MIND dataset preparation at the start of any NRMS (or other neural news recommendation) workflow. This step must be completed before hyperparameter configuration, model initialization, training, or evaluation can proceed. It is the foundational data acquisition step in the NRMS pipeline.

Theoretical Basis

The MIND dataset is structured around impression logs. Each impression records:

A user ID
A timestamp
The user's click history (list of previously clicked news article IDs)
A set of candidate news articles displayed to the user, each labeled as clicked (1) or not clicked (0)

The news articles file contains:

News ID, category, subcategory, title, abstract, URL, and entities

This impression-based structure enables training with negative sampling (npratio) where each positive click is paired with a fixed number of negative (non-clicked) candidates from the same impression.

Dataset structure after extraction:
data_path/
  train/
    news.tsv          # News article metadata (ID, category, title, abstract, ...)
    behaviors.tsv      # User impression logs (user, time, history, impressions)
  valid/
    news.tsv
    behaviors.tsv
  utils/
    embedding.npy      # Pre-trained word embedding matrix (numpy array)
    word_dict.pkl      # Word-to-index dictionary (pickle)
    uid2index.pkl      # User ID-to-index dictionary (pickle)

Related Pages

Implemented By

Implementation:Recommenders_team_Recommenders_Get_Mind_Data_Set

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment