Principle:LaurentMazare Tch rs Seq2Seq Attention Translation

Knowledge Sources	LaurentMazare_Tch_rs Bahdanau et al., 2014 Sutskever et al., 2014
Domains	Deep Learning, Natural Language Processing, Machine Translation
Last Updated	2026-02-08 00:00 GMT

Overview

Sequence-to-sequence models with attention encode variable-length input sequences and decode output sequences by dynamically attending to relevant encoder states at each generation step.

Description

The encoder-decoder architecture addresses the problem of mapping between sequences of different lengths. The encoder processes the input sequence (e.g., a sentence in the source language) one token at a time, producing a sequence of hidden states that capture contextual information at each position. In a basic seq2seq model, only the final encoder hidden state is passed to the decoder, creating an information bottleneck.

The attention mechanism resolves this bottleneck by allowing the decoder to look at all encoder hidden states when generating each output token. At each decoding step, attention computes a relevance score between the current decoder state and every encoder hidden state. These scores are normalized into a probability distribution (attention weights), which is used to form a context vector as a weighted sum of encoder states. This context vector is then combined with the decoder's own hidden state to predict the next output token.

This approach enables the model to learn soft alignments between source and target positions, effectively learning which parts of the input are most relevant for generating each part of the output.

Usage

Apply the encoder-decoder with attention principle when:

Translating between natural languages where input and output lengths differ
Performing any sequence-to-sequence task such as summarization or question answering
Building generative models that must condition on variable-length input
The input sequence is long enough that a fixed-size encoding would lose information

Theoretical Basis

Encoder

The encoder processes input tokens $x_{1}, x_{2}, \dots, x_{T}$ through a recurrent network, producing hidden states:

$h_{t} = f (x_{t}, h_{t - 1})$

where $f$ is typically an LSTM or GRU cell.

Attention Mechanism

At each decoder time step $s$ , attention scores are computed:

$e_{s, t} = a (d_{s}, h_{t})$

where $d_{s}$ is the decoder hidden state and $a$ is a learned alignment function (e.g., a small feedforward network).

Attention weights are obtained via softmax normalization:

$α_{s, t} = \frac{\exp (e_{s, t})}{\sum_{k = 1}^{T} \exp (e_{s, k})}$

The context vector is the weighted sum of encoder states:

$c_{s} = \sum_{t = 1}^{T} α_{s, t} h_{t}$

Decoder

The decoder generates output tokens one at a time, conditioning on the context vector and its own previous state:

$d_{s} = g (y_{s - 1}, d_{s - 1}, c_{s})$

$P (y_{s} | y_{< s}, x) = softmax (W [d_{s}; c_{s}] + b)$

Training

The model is trained to maximize the log-likelihood of the target sequence given the source sequence. Teacher forcing feeds ground-truth target tokens as decoder input during training rather than the model's own predictions, stabilizing learning.

Related Pages

Implementation:LaurentMazare_Tch_rs_Seq2Seq_Translation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment