Principle:Recommenders team Recommenders MIND Dataset Preparation
| Knowledge Sources | |
|---|---|
| Domains | News Recommendation, Dataset Preparation, NLP |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Preparing the Microsoft News Dataset (MIND) for news recommendation involves downloading and extracting news articles, user behavior logs, pre-trained word embeddings, and auxiliary dictionaries required by neural news recommender models.
Description
The Microsoft News Dataset (MIND) provides large-scale click logs collected from Microsoft News for training and evaluating news recommendation systems. The dataset comes in three sizes:
- MIND Large — Full-scale dataset for production-level experiments.
- MIND Small — A smaller subset for rapid prototyping and development.
- MIND Demo — A minimal demo set for quick testing and tutorials.
Each dataset variant contains:
- Training set — News articles and user click behaviors for model training (e.g.,
MINDsmall_train.zip). - Validation set — Held-out behaviors for evaluation during training (e.g.,
MINDsmall_dev.zip). - Utilities — Pre-trained GloVe word embeddings (
embedding.npy), word dictionaries (word_dict.pkl), and user dictionaries (uid2index.pkl) packaged in a utils archive (e.g.,MINDsmall_utils.zip).
The preparation workflow consists of:
- Selecting the dataset size (large, small, or demo) via
get_mind_data_set. - Downloading each zip archive from the Azure-hosted repository using
download_deeprec_resources. - Extracting the archives into a local data directory.
- Verifying that the expected files (news.tsv, behaviors.tsv, embedding.npy, word_dict.pkl, uid2index.pkl) are present.
Usage
Use MIND dataset preparation at the start of any NRMS (or other neural news recommendation) workflow. This step must be completed before hyperparameter configuration, model initialization, training, or evaluation can proceed. It is the foundational data acquisition step in the NRMS pipeline.
Theoretical Basis
The MIND dataset is structured around impression logs. Each impression records:
- A user ID
- A timestamp
- The user's click history (list of previously clicked news article IDs)
- A set of candidate news articles displayed to the user, each labeled as clicked (1) or not clicked (0)
The news articles file contains:
- News ID, category, subcategory, title, abstract, URL, and entities
This impression-based structure enables training with negative sampling (npratio) where each positive click is paired with a fixed number of negative (non-clicked) candidates from the same impression.
Dataset structure after extraction:
data_path/
train/
news.tsv # News article metadata (ID, category, title, abstract, ...)
behaviors.tsv # User impression logs (user, time, history, impressions)
valid/
news.tsv
behaviors.tsv
utils/
embedding.npy # Pre-trained word embedding matrix (numpy array)
word_dict.pkl # Word-to-index dictionary (pickle)
uid2index.pkl # User ID-to-index dictionary (pickle)