Principle:Recommenders team Recommenders NRMS Architecture
| Knowledge Sources | |
|---|---|
| Domains | News Recommendation, Deep Learning, Self-Attention, NLP |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
NRMS (Neural News Recommendation with Multi-Head Self-Attention) is a neural news recommendation architecture that uses multi-head self-attention for both news encoding (from word sequences) and user encoding (from clicked news history), producing user and news vectors scored via dot product.
Description
The NRMS model, proposed by Wu et al. at EMNLP-IJCNLP 2019, addresses the news recommendation problem through a two-level attention architecture:
News Encoder:
- Takes a news article title as a sequence of word indices.
- Converts word indices to dense vectors using a pre-trained word embedding layer (GloVe).
- Applies multi-head self-attention to capture contextual relationships between words.
- Applies additive attention to aggregate the contextualized word representations into a single news vector.
User Encoder:
- Takes a user's click history as a sequence of news articles.
- Encodes each clicked article using the shared news encoder (via TimeDistributed).
- Applies multi-head self-attention over the clicked news vectors to model inter-article dependencies in the user's reading behavior.
- Applies additive attention to aggregate the clicked news representations into a single user vector.
Candidate Scoring:
- During training, the model scores
npratio + 1candidate articles (1 positive + npratio negatives) against the user vector via dot product, followed by softmax. - During inference, each candidate is scored individually via dot product, followed by sigmoid.
The key advantage of NRMS over CNN-based news recommenders (e.g., NAML, NPA) is that multi-head self-attention can capture long-range dependencies between words and between clicked articles without the locality constraints of convolution.
Usage
Use the NRMS architecture when building a news recommendation system that requires modeling word-level and article-level interactions through attention mechanisms. NRMS is the model of choice when you need strong performance on the MIND benchmark with a relatively simple architecture (no category or entity features required, only titles).
Theoretical Basis
The NRMS architecture is built on two core attention mechanisms:
Multi-Head Self-Attention
Given input sequence H = [h_1, h_2, ..., h_N]:
Q = H * W_Q, K = H * W_K, V = H * W_V Attention(Q, K, V) = softmax(Q * K^T / sqrt(d_k)) * V MultiHead(H) = Concat(head_1, ..., head_h) * W_O where head_i = Attention(H * W_Qi, H * W_Ki, H * W_Vi)
Parameters: head_num (number of heads), head_dim (dimension per head).
Additive Attention
Given contextualized sequence M = [m_1, m_2, ..., m_N]:
a_i = q^T * tanh(W_a * m_i + b_a) alpha_i = exp(a_i) / sum_j(exp(a_j)) output = sum_i(alpha_i * m_i)
Parameter: attention_hidden_dim (dimension of the attention projection).
Full Pipeline
News Encoder:
title words -> Embedding(word_emb_dim) -> Dropout
-> SelfAttention(head_num, head_dim) -> Dropout
-> AdditiveAttention(attention_hidden_dim) -> news_vector
User Encoder:
clicked_news -> TimeDistributed(NewsEncoder) -> clicked_vectors
-> SelfAttention(head_num, head_dim)
-> AdditiveAttention(attention_hidden_dim) -> user_vector
Training Score:
dot(candidate_news_vectors, user_vector) -> softmax -> probabilities
Inference Score:
dot(single_news_vector, user_vector) -> sigmoid -> score