Principle:Recommenders team Recommenders NRMS Architecture

Knowledge Sources	Recommenders Neural News Recommendation with Multi-Head Self-Attention
Domains	News Recommendation, Deep Learning, Self-Attention, NLP
Last Updated	2026-02-10 00:00 GMT

Overview

NRMS (Neural News Recommendation with Multi-Head Self-Attention) is a neural news recommendation architecture that uses multi-head self-attention for both news encoding (from word sequences) and user encoding (from clicked news history), producing user and news vectors scored via dot product.

Description

The NRMS model, proposed by Wu et al. at EMNLP-IJCNLP 2019, addresses the news recommendation problem through a two-level attention architecture:

News Encoder:

Takes a news article title as a sequence of word indices.
Converts word indices to dense vectors using a pre-trained word embedding layer (GloVe).
Applies multi-head self-attention to capture contextual relationships between words.
Applies additive attention to aggregate the contextualized word representations into a single news vector.

User Encoder:

Takes a user's click history as a sequence of news articles.
Encodes each clicked article using the shared news encoder (via TimeDistributed).
Applies multi-head self-attention over the clicked news vectors to model inter-article dependencies in the user's reading behavior.
Applies additive attention to aggregate the clicked news representations into a single user vector.

Candidate Scoring:

During training, the model scores npratio + 1 candidate articles (1 positive + npratio negatives) against the user vector via dot product, followed by softmax.
During inference, each candidate is scored individually via dot product, followed by sigmoid.

The key advantage of NRMS over CNN-based news recommenders (e.g., NAML, NPA) is that multi-head self-attention can capture long-range dependencies between words and between clicked articles without the locality constraints of convolution.

Usage

Use the NRMS architecture when building a news recommendation system that requires modeling word-level and article-level interactions through attention mechanisms. NRMS is the model of choice when you need strong performance on the MIND benchmark with a relatively simple architecture (no category or entity features required, only titles).

Theoretical Basis

The NRMS architecture is built on two core attention mechanisms:

Multi-Head Self-Attention

Given input sequence H = [h_1, h_2, ..., h_N]:

Q = H * W_Q,  K = H * W_K,  V = H * W_V

Attention(Q, K, V) = softmax(Q * K^T / sqrt(d_k)) * V

MultiHead(H) = Concat(head_1, ..., head_h) * W_O
  where head_i = Attention(H * W_Qi, H * W_Ki, H * W_Vi)

Parameters: head_num (number of heads), head_dim (dimension per head).

Additive Attention

Given contextualized sequence M = [m_1, m_2, ..., m_N]:

a_i = q^T * tanh(W_a * m_i + b_a)
alpha_i = exp(a_i) / sum_j(exp(a_j))
output = sum_i(alpha_i * m_i)

Parameter: attention_hidden_dim (dimension of the attention projection).

Full Pipeline

News Encoder:
  title words -> Embedding(word_emb_dim) -> Dropout
    -> SelfAttention(head_num, head_dim) -> Dropout
    -> AdditiveAttention(attention_hidden_dim) -> news_vector

User Encoder:
  clicked_news -> TimeDistributed(NewsEncoder) -> clicked_vectors
    -> SelfAttention(head_num, head_dim)
    -> AdditiveAttention(attention_hidden_dim) -> user_vector

Training Score:
  dot(candidate_news_vectors, user_vector) -> softmax -> probabilities

Inference Score:
  dot(single_news_vector, user_vector) -> sigmoid -> score

Related Pages

Implemented By

Implementation:Recommenders_team_Recommenders_NRMSModel_Init

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment