Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Recommenders team Recommenders MINDAll Iterator

From Leeroopedia
Revision as of 16:29, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Recommenders_team_Recommenders_MINDAll_Iterator.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains News Recommendation, Data Loading, MIND Dataset
Last Updated 2026-02-10 00:00 GMT

Overview

The MINDAllIterator is a specialized data loader that reads and parses the full MIND dataset format for the NAML news recommendation model, handling multi-field news representations including titles, bodies, categories, and subcategories.

Description

MINDAllIterator extends BaseIterator to provide mini-batch data loading for the NAML model's multi-view architecture. Unlike the simpler MINDIterator which only handles news titles, this iterator processes the complete set of article features required by NAML: title words, body (abstract) words, verticals (categories), and sub-verticals (subcategories).

The iterator loads four pickled dictionaries at initialization: a word dictionary for mapping tokens to indices, a vertical dictionary for category mapping, a sub-vertical dictionary for subcategory mapping, and a user dictionary for user ID indexing. News articles are parsed from a tab-separated file where each article's title and body are tokenized and converted to fixed-length integer index arrays. Behavior logs record each user's click history and impression-level interactions.

During training, the iterator supports negative sampling with a configurable negative-to-positive ratio (npratio). For each positive click in an impression, it samples npratio negative articles and yields the combined batch. When npratio is set to -1, no negative sampling is performed and each news article in an impression is yielded individually. Data is loaded per mini-batch rather than loading the entire dataset into memory, enabling efficient processing of large files.

The iterator provides separate loading methods for different evaluation stages: load_data_from_file for training batches, load_user_from_file for user encoder inference, load_news_from_file for news encoder inference, and load_impression_from_file for impression-level evaluation.

Usage

Use MINDAllIterator when working with the NAML model or any model that requires the full set of MIND article features (title, body, category, subcategory). It is specifically required by the NAML architecture due to its multi-view news encoder design. For models that only require title features (such as NRMS, LSTUR, or NPA), use the simpler MINDIterator instead.

Code Reference

Source Location

Signature

class MINDAllIterator(BaseIterator):
    def __init__(self, hparams, npratio=-1, col_spliter="\t", ID_spliter="%")
    def load_dict(self, file_path)
    def init_news(self, news_file)
    def init_behaviors(self, behaviors_file)
    def parser_one_line(self, line)
    def load_data_from_file(self, news_file, behavior_file)
    def _convert_data(self, label_list, imp_indexes, user_indexes, candidate_title_indexes, candidate_ab_indexes, candidate_vert_indexes, candidate_subvert_indexes, click_title_indexes, click_ab_indexes, click_vert_indexes, click_subvert_indexes)
    def load_user_from_file(self, news_file, behavior_file)
    def _convert_user_data(self, user_indexes, impr_indexes, click_title_indexes, click_ab_indexes, click_vert_indexes, click_subvert_indexes)
    def load_news_from_file(self, news_file)
    def _convert_news_data(self, news_indexes, candidate_title_indexes, candidate_ab_indexes, candidate_vert_indexes, candidate_subvert_indexes)
    def load_impression_from_file(self, behaivors_file)

Import

from recommenders.models.newsrec.io.mind_all_iterator import MINDAllIterator

I/O Contract

Inputs

Name Type Required Description
hparams object Yes Global hyper-parameters containing batch_size, title_size, body_size, his_size, wordDict_file, vertDict_file, subvertDict_file, and userDict_file
npratio int No Negative-to-positive sampling ratio. Default is -1 (no negative sampling). Set to a positive integer for training.
col_spliter str No Column delimiter in data files. Default is tab character.
ID_spliter str No ID delimiter in data files. Default is "%".
news_file str Yes (for load methods) Path to the news file containing article metadata (ID, category, subcategory, title, abstract, URL).
behavior_file str Yes (for load methods) Path to the behaviors file containing user impression logs.

Outputs

Name Type Description
training batch (from load_data_from_file) dict Dictionary with keys: "impression_index_batch", "user_index_batch", "clicked_title_batch", "clicked_ab_batch", "clicked_vert_batch", "clicked_subvert_batch", "candidate_title_batch", "candidate_ab_batch", "candidate_vert_batch", "candidate_subvert_batch", "labels" -- all as numpy arrays.
user batch (from load_user_from_file) dict Dictionary with keys: "user_index_batch", "impr_index_batch", "clicked_title_batch", "clicked_ab_batch", "clicked_vert_batch", "clicked_subvert_batch" -- all as numpy arrays.
news batch (from load_news_from_file) dict Dictionary with keys: "news_index_batch", "candidate_title_batch", "candidate_ab_batch", "candidate_vert_batch", "candidate_subvert_batch" -- all as numpy arrays.
impression data (from load_impression_from_file) tuple Tuple of (impression_index, impression_news_indices, user_index, impression_labels).

Usage Examples

Basic Usage

from recommenders.models.newsrec.io.mind_all_iterator import MINDAllIterator

# Initialize the iterator with hyper-parameters
iterator = MINDAllIterator(hparams, npratio=4)

# Load training batches from news and behavior files
for batch in iterator.load_data_from_file(news_file, behavior_file):
    # batch is a dict of numpy arrays ready for model consumption
    labels = batch["labels"]
    candidate_titles = batch["candidate_title_batch"]
    clicked_titles = batch["clicked_title_batch"]
    # ... process batch through NAML model

# Load user features for inference
for user_batch in iterator.load_user_from_file(news_file, behavior_file):
    user_indices = user_batch["user_index_batch"]
    clicked_history = user_batch["clicked_title_batch"]

# Load news features for inference
for news_batch in iterator.load_news_from_file(news_file):
    news_indices = news_batch["news_index_batch"]
    news_titles = news_batch["candidate_title_batch"]

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment