Principle:Lm sys FastChat Seq2Seq SFT Training

Field	Value
Page Type	Principle
Title	Seq2Seq SFT Training
Repository	lm-sys/FastChat
Workflow	Finetuning
Domains	Training, NLP
Knowledge Sources	fastchat/train/train_flant5.py, Hugging Face Seq2SeqTrainer documentation
Last Updated	2026-02-07 14:00 GMT

Overview

This principle covers the theory and methodology of supervised fine-tuning (SFT) applied to sequence-to-sequence (encoder-decoder) models such as Flan-T5. Unlike causal language models that generate text left-to-right from a single input stream, encoder-decoder architectures use separate components for understanding input and producing output. This structural difference requires distinct training strategies for tokenization, loss computation, and data formatting.

Description

Encoder-Decoder Architecture Differences

Sequence-to-sequence models differ fundamentally from causal (decoder-only) language models in their architecture. An encoder processes the full input sequence bidirectionally, producing a set of contextualized hidden states. A decoder then attends to these encoder outputs via cross-attention while autoregressively generating the target sequence. This separation means that:

The input (instruction/prompt) and output (response) are tokenized independently and passed to different components of the model.
The encoder can leverage bidirectional attention over the input, capturing richer contextual representations than a causal mask permits.
The decoder generates tokens conditioned on both the encoder's output and previously generated decoder tokens.

Conditional Generation with Teacher Forcing

During SFT, the decoder is trained using teacher forcing: at each decoding step, the ground-truth token from the training data is fed as input (rather than the model's own prediction). This stabilizes training by preventing error accumulation across time steps. The training objective is to maximize the conditional log-likelihood of the target sequence given the input:

L = -sum_{t=1}^{T} log P(y_t | y_{<t}, encoder(x))

where x is the input sequence, y is the target sequence, and T is the target length.

Label Masking for Decoder-Only Loss

In encoder-decoder SFT, loss is computed only over the decoder's output tokens. The encoder's input tokens do not contribute to the loss function. This is implemented by setting labels for the decoder input: padding tokens and any special prefix tokens are masked with a value of -100 so they are ignored by the cross-entropy loss function. This ensures the model is optimized solely for generating correct responses, not for reconstructing the input.

Tokenization with Separate Input/Target Sequences

Unlike causal LM fine-tuning where the prompt and response are concatenated into a single token sequence, seq2seq training requires separate tokenization of the input and target:

The input sequence (instruction/conversation) is tokenized and passed to the encoder via input_ids.
The target sequence (desired response) is tokenized separately and provided as labels for the decoder.
Truncation and padding are handled independently for each side, often with different maximum lengths to reflect the asymmetry between prompt and response lengths.

Theoretical Basis

Sequence-to-sequence models were formalized by Sutskever et al. (2014), who demonstrated that an encoder LSTM could compress a variable-length input into a fixed-dimensional vector, which a decoder LSTM then used to generate a variable-length output. The Transformer architecture (Vaswani et al., 2017) replaced recurrence with self-attention, enabling parallelized training and better long-range dependency modeling. Models like T5 and Flan-T5 extend this paradigm by pre-training the encoder-decoder jointly on a mixture of unsupervised and supervised tasks. SFT adapts these pre-trained seq2seq models by fine-tuning on instruction-response pairs: the encoder receives the instruction and the decoder generates the response. The encoder's bidirectional attention provides a richer understanding of the instruction than causal models can achieve, making seq2seq architectures particularly effective for tasks requiring deep input comprehension before generation.

Related Pages

Implementation:Lm_sys_FastChat_Train_FlanT5
Implemented by: Implementation:Lm_sys_FastChat_Train_FlanT5

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment