Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:LaurentMazare Tch rs Seq2Seq Attention Translation

From Leeroopedia


Knowledge Sources
Domains Deep Learning, Natural Language Processing, Machine Translation
Last Updated 2026-02-08 00:00 GMT

Overview

Sequence-to-sequence models with attention encode variable-length input sequences and decode output sequences by dynamically attending to relevant encoder states at each generation step.

Description

The encoder-decoder architecture addresses the problem of mapping between sequences of different lengths. The encoder processes the input sequence (e.g., a sentence in the source language) one token at a time, producing a sequence of hidden states that capture contextual information at each position. In a basic seq2seq model, only the final encoder hidden state is passed to the decoder, creating an information bottleneck.

The attention mechanism resolves this bottleneck by allowing the decoder to look at all encoder hidden states when generating each output token. At each decoding step, attention computes a relevance score between the current decoder state and every encoder hidden state. These scores are normalized into a probability distribution (attention weights), which is used to form a context vector as a weighted sum of encoder states. This context vector is then combined with the decoder's own hidden state to predict the next output token.

This approach enables the model to learn soft alignments between source and target positions, effectively learning which parts of the input are most relevant for generating each part of the output.

Usage

Apply the encoder-decoder with attention principle when:

  • Translating between natural languages where input and output lengths differ
  • Performing any sequence-to-sequence task such as summarization or question answering
  • Building generative models that must condition on variable-length input
  • The input sequence is long enough that a fixed-size encoding would lose information

Theoretical Basis

Encoder

The encoder processes input tokens x1,x2,,xT through a recurrent network, producing hidden states:

ht=f(xt,ht1)

where f is typically an LSTM or GRU cell.

Attention Mechanism

At each decoder time step s, attention scores are computed:

es,t=a(ds,ht)

where ds is the decoder hidden state and a is a learned alignment function (e.g., a small feedforward network).

Attention weights are obtained via softmax normalization:

αs,t=exp(es,t)k=1Texp(es,k)

The context vector is the weighted sum of encoder states:

cs=t=1Tαs,tht

Decoder

The decoder generates output tokens one at a time, conditioning on the context vector and its own previous state:

ds=g(ys1,ds1,cs)

P(ys|y<s,x)=softmax(W[ds;cs]+b)

Training

The model is trained to maximize the log-likelihood of the target sequence given the source sequence. Teacher forcing feeds ground-truth target tokens as decoder input during training rather than the model's own predictions, stabilizing learning.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment