Implementation:Recommenders team Recommenders MINDAll Iterator
| Knowledge Sources | |
|---|---|
| Domains | News Recommendation, Data Loading, MIND Dataset |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
The MINDAllIterator is a specialized data loader that reads and parses the full MIND dataset format for the NAML news recommendation model, handling multi-field news representations including titles, bodies, categories, and subcategories.
Description
MINDAllIterator extends BaseIterator to provide mini-batch data loading for the NAML model's multi-view architecture. Unlike the simpler MINDIterator which only handles news titles, this iterator processes the complete set of article features required by NAML: title words, body (abstract) words, verticals (categories), and sub-verticals (subcategories).
The iterator loads four pickled dictionaries at initialization: a word dictionary for mapping tokens to indices, a vertical dictionary for category mapping, a sub-vertical dictionary for subcategory mapping, and a user dictionary for user ID indexing. News articles are parsed from a tab-separated file where each article's title and body are tokenized and converted to fixed-length integer index arrays. Behavior logs record each user's click history and impression-level interactions.
During training, the iterator supports negative sampling with a configurable negative-to-positive ratio (npratio). For each positive click in an impression, it samples npratio negative articles and yields the combined batch. When npratio is set to -1, no negative sampling is performed and each news article in an impression is yielded individually. Data is loaded per mini-batch rather than loading the entire dataset into memory, enabling efficient processing of large files.
The iterator provides separate loading methods for different evaluation stages: load_data_from_file for training batches, load_user_from_file for user encoder inference, load_news_from_file for news encoder inference, and load_impression_from_file for impression-level evaluation.
Usage
Use MINDAllIterator when working with the NAML model or any model that requires the full set of MIND article features (title, body, category, subcategory). It is specifically required by the NAML architecture due to its multi-view news encoder design. For models that only require title features (such as NRMS, LSTUR, or NPA), use the simpler MINDIterator instead.
Code Reference
Source Location
- Repository: Recommenders
- File: recommenders/models/newsrec/io/mind_all_iterator.py
- Lines: 1-602
Signature
class MINDAllIterator(BaseIterator):
def __init__(self, hparams, npratio=-1, col_spliter="\t", ID_spliter="%")
def load_dict(self, file_path)
def init_news(self, news_file)
def init_behaviors(self, behaviors_file)
def parser_one_line(self, line)
def load_data_from_file(self, news_file, behavior_file)
def _convert_data(self, label_list, imp_indexes, user_indexes, candidate_title_indexes, candidate_ab_indexes, candidate_vert_indexes, candidate_subvert_indexes, click_title_indexes, click_ab_indexes, click_vert_indexes, click_subvert_indexes)
def load_user_from_file(self, news_file, behavior_file)
def _convert_user_data(self, user_indexes, impr_indexes, click_title_indexes, click_ab_indexes, click_vert_indexes, click_subvert_indexes)
def load_news_from_file(self, news_file)
def _convert_news_data(self, news_indexes, candidate_title_indexes, candidate_ab_indexes, candidate_vert_indexes, candidate_subvert_indexes)
def load_impression_from_file(self, behaivors_file)
Import
from recommenders.models.newsrec.io.mind_all_iterator import MINDAllIterator
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| hparams | object | Yes | Global hyper-parameters containing batch_size, title_size, body_size, his_size, wordDict_file, vertDict_file, subvertDict_file, and userDict_file |
| npratio | int | No | Negative-to-positive sampling ratio. Default is -1 (no negative sampling). Set to a positive integer for training. |
| col_spliter | str | No | Column delimiter in data files. Default is tab character. |
| ID_spliter | str | No | ID delimiter in data files. Default is "%". |
| news_file | str | Yes (for load methods) | Path to the news file containing article metadata (ID, category, subcategory, title, abstract, URL). |
| behavior_file | str | Yes (for load methods) | Path to the behaviors file containing user impression logs. |
Outputs
| Name | Type | Description |
|---|---|---|
| training batch (from load_data_from_file) | dict | Dictionary with keys: "impression_index_batch", "user_index_batch", "clicked_title_batch", "clicked_ab_batch", "clicked_vert_batch", "clicked_subvert_batch", "candidate_title_batch", "candidate_ab_batch", "candidate_vert_batch", "candidate_subvert_batch", "labels" -- all as numpy arrays. |
| user batch (from load_user_from_file) | dict | Dictionary with keys: "user_index_batch", "impr_index_batch", "clicked_title_batch", "clicked_ab_batch", "clicked_vert_batch", "clicked_subvert_batch" -- all as numpy arrays. |
| news batch (from load_news_from_file) | dict | Dictionary with keys: "news_index_batch", "candidate_title_batch", "candidate_ab_batch", "candidate_vert_batch", "candidate_subvert_batch" -- all as numpy arrays. |
| impression data (from load_impression_from_file) | tuple | Tuple of (impression_index, impression_news_indices, user_index, impression_labels). |
Usage Examples
Basic Usage
from recommenders.models.newsrec.io.mind_all_iterator import MINDAllIterator
# Initialize the iterator with hyper-parameters
iterator = MINDAllIterator(hparams, npratio=4)
# Load training batches from news and behavior files
for batch in iterator.load_data_from_file(news_file, behavior_file):
# batch is a dict of numpy arrays ready for model consumption
labels = batch["labels"]
candidate_titles = batch["candidate_title_batch"]
clicked_titles = batch["clicked_title_batch"]
# ... process batch through NAML model
# Load user features for inference
for user_batch in iterator.load_user_from_file(news_file, behavior_file):
user_indices = user_batch["user_index_batch"]
clicked_history = user_batch["clicked_title_batch"]
# Load news features for inference
for news_batch in iterator.load_news_from_file(news_file):
news_indices = news_batch["news_index_batch"]
news_titles = news_batch["candidate_title_batch"]