Principle:Recommenders team Recommenders News Recommendation Evaluation

Knowledge Sources	Recommenders MIND: A Large-scale Dataset for News Recommendation
Domains	News Recommendation, Evaluation Metrics, Information Retrieval
Last Updated	2026-02-10 00:00 GMT

Overview

Evaluating news recommendation models requires impression-level metrics where predictions are grouped by impression and compared against binary click labels using AUC, MRR, NDCG@5, and NDCG@10.

Description

News recommendation evaluation differs from standard classification evaluation because predictions must be assessed per impression rather than globally. Each impression represents a single user visit where multiple candidate news articles were displayed. The model's task is to rank the clicked articles above the non-clicked ones within each impression.

The evaluation process follows these steps:

Prediction Generation — The model scores each candidate news article within every impression in the validation/test behaviors file.
Grouping — Predictions, labels, and impression indices are grouped by impression ID, so that metrics are computed within each impression context.
Metric Computation — Four standard metrics are computed per impression and then averaged:
- AUC (Area Under ROC Curve) — Measures the probability that a clicked article is ranked higher than a non-clicked article.
- MRR (Mean Reciprocal Rank) — The average of 1/rank for the first clicked article in each impression.
- NDCG@5 (Normalized Discounted Cumulative Gain at 5) — Measures ranking quality in the top 5 positions.
- NDCG@10 (Normalized Discounted Cumulative Gain at 10) — Measures ranking quality in the top 10 positions.

The evaluation supports two execution modes:

Slow eval — Runs the full scorer model for each impression (accurate but slower).
Fast eval — Pre-computes news and user embeddings, then scores via dot product (much faster for large datasets, enabled by support_quick_scoring=True).

Usage

Use news recommendation evaluation after training (per epoch for validation) and after the final model is trained (on the test set). It is also used within the training loop (fit()) to monitor validation performance at the end of each epoch.

Theoretical Basis

AUC (Area Under ROC Curve)

For each impression with labels y and scores s:
  AUC = P(s_positive > s_negative)

Computed as group_auc: average AUC across all impressions.

MRR (Mean Reciprocal Rank)

For each impression:
  Sort candidates by predicted score descending.
  Find rank r of the first clicked article.
  RR = 1 / r

MRR = mean(RR) across all impressions.

NDCG@K (Normalized Discounted Cumulative Gain)

For each impression:
  Sort candidates by predicted score descending.
  DCG@K = sum_{i=1}^{K} (2^{y_i} - 1) / log2(i + 1)
  IDCG@K = DCG@K with ideal (sorted by true labels) ordering.
  NDCG@K = DCG@K / IDCG@K

Reported for K=5 and K=10.

Evaluation Flow

behaviors_file -> load impressions
  -> for each impression:
       score all candidates (slow or fast mode)
  -> group predictions by impression_id
  -> cal_metric(group_labels, group_preds, metrics)
  -> return {"group_auc": ..., "mean_mrr": ..., "ndcg@5": ..., "ndcg@10": ...}

Related Pages

Implemented By

Implementation:Recommenders_team_Recommenders_BaseModel_Run_Eval

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment