Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Recommenders team Recommenders MIND Dataset Preparation

From Leeroopedia


Knowledge Sources
Domains News Recommendation, Dataset Preparation, NLP
Last Updated 2026-02-10 00:00 GMT

Overview

Preparing the Microsoft News Dataset (MIND) for news recommendation involves downloading and extracting news articles, user behavior logs, pre-trained word embeddings, and auxiliary dictionaries required by neural news recommender models.

Description

The Microsoft News Dataset (MIND) provides large-scale click logs collected from Microsoft News for training and evaluating news recommendation systems. The dataset comes in three sizes:

  • MIND Large — Full-scale dataset for production-level experiments.
  • MIND Small — A smaller subset for rapid prototyping and development.
  • MIND Demo — A minimal demo set for quick testing and tutorials.

Each dataset variant contains:

  • Training set — News articles and user click behaviors for model training (e.g., MINDsmall_train.zip).
  • Validation set — Held-out behaviors for evaluation during training (e.g., MINDsmall_dev.zip).
  • Utilities — Pre-trained GloVe word embeddings (embedding.npy), word dictionaries (word_dict.pkl), and user dictionaries (uid2index.pkl) packaged in a utils archive (e.g., MINDsmall_utils.zip).

The preparation workflow consists of:

  1. Selecting the dataset size (large, small, or demo) via get_mind_data_set.
  2. Downloading each zip archive from the Azure-hosted repository using download_deeprec_resources.
  3. Extracting the archives into a local data directory.
  4. Verifying that the expected files (news.tsv, behaviors.tsv, embedding.npy, word_dict.pkl, uid2index.pkl) are present.

Usage

Use MIND dataset preparation at the start of any NRMS (or other neural news recommendation) workflow. This step must be completed before hyperparameter configuration, model initialization, training, or evaluation can proceed. It is the foundational data acquisition step in the NRMS pipeline.

Theoretical Basis

The MIND dataset is structured around impression logs. Each impression records:

  • A user ID
  • A timestamp
  • The user's click history (list of previously clicked news article IDs)
  • A set of candidate news articles displayed to the user, each labeled as clicked (1) or not clicked (0)

The news articles file contains:

  • News ID, category, subcategory, title, abstract, URL, and entities

This impression-based structure enables training with negative sampling (npratio) where each positive click is paired with a fixed number of negative (non-clicked) candidates from the same impression.

Dataset structure after extraction:
data_path/
  train/
    news.tsv          # News article metadata (ID, category, title, abstract, ...)
    behaviors.tsv      # User impression logs (user, time, history, impressions)
  valid/
    news.tsv
    behaviors.tsv
  utils/
    embedding.npy      # Pre-trained word embedding matrix (numpy array)
    word_dict.pkl      # Word-to-index dictionary (pickle)
    uid2index.pkl      # User ID-to-index dictionary (pickle)

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment