Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Lm sys FastChat Seq2Seq SFT Training

From Leeroopedia


Field Value
Page Type Principle
Title Seq2Seq SFT Training
Repository lm-sys/FastChat
Workflow Finetuning
Domains Training, NLP
Knowledge Sources fastchat/train/train_flant5.py, Hugging Face Seq2SeqTrainer documentation
Last Updated 2026-02-07 14:00 GMT

Overview

This principle covers the theory and methodology of supervised fine-tuning (SFT) applied to sequence-to-sequence (encoder-decoder) models such as Flan-T5. Unlike causal language models that generate text left-to-right from a single input stream, encoder-decoder architectures use separate components for understanding input and producing output. This structural difference requires distinct training strategies for tokenization, loss computation, and data formatting.

Description

Encoder-Decoder Architecture Differences

Sequence-to-sequence models differ fundamentally from causal (decoder-only) language models in their architecture. An encoder processes the full input sequence bidirectionally, producing a set of contextualized hidden states. A decoder then attends to these encoder outputs via cross-attention while autoregressively generating the target sequence. This separation means that:

  • The input (instruction/prompt) and output (response) are tokenized independently and passed to different components of the model.
  • The encoder can leverage bidirectional attention over the input, capturing richer contextual representations than a causal mask permits.
  • The decoder generates tokens conditioned on both the encoder's output and previously generated decoder tokens.

Conditional Generation with Teacher Forcing

During SFT, the decoder is trained using teacher forcing: at each decoding step, the ground-truth token from the training data is fed as input (rather than the model's own prediction). This stabilizes training by preventing error accumulation across time steps. The training objective is to maximize the conditional log-likelihood of the target sequence given the input:

L = -sum_{t=1}^{T} log P(y_t | y_{<t}, encoder(x))

where x is the input sequence, y is the target sequence, and T is the target length.

Label Masking for Decoder-Only Loss

In encoder-decoder SFT, loss is computed only over the decoder's output tokens. The encoder's input tokens do not contribute to the loss function. This is implemented by setting labels for the decoder input: padding tokens and any special prefix tokens are masked with a value of -100 so they are ignored by the cross-entropy loss function. This ensures the model is optimized solely for generating correct responses, not for reconstructing the input.

Tokenization with Separate Input/Target Sequences

Unlike causal LM fine-tuning where the prompt and response are concatenated into a single token sequence, seq2seq training requires separate tokenization of the input and target:

  • The input sequence (instruction/conversation) is tokenized and passed to the encoder via input_ids.
  • The target sequence (desired response) is tokenized separately and provided as labels for the decoder.
  • Truncation and padding are handled independently for each side, often with different maximum lengths to reflect the asymmetry between prompt and response lengths.

Theoretical Basis

Sequence-to-sequence models were formalized by Sutskever et al. (2014), who demonstrated that an encoder LSTM could compress a variable-length input into a fixed-dimensional vector, which a decoder LSTM then used to generate a variable-length output. The Transformer architecture (Vaswani et al., 2017) replaced recurrence with self-attention, enabling parallelized training and better long-range dependency modeling. Models like T5 and Flan-T5 extend this paradigm by pre-training the encoder-decoder jointly on a mixture of unsupervised and supervised tasks. SFT adapts these pre-trained seq2seq models by fine-tuning on instruction-response pairs: the encoder receives the instruction and the decoder generates the response. The encoder's bidirectional attention provides a richer understanding of the instruction than causal models can achieve, making seq2seq architectures particularly effective for tasks requiring deep input comprehension before generation.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment