Principle:LaurentMazare Tch rs Seq2Seq Dataset Loading
| Knowledge Sources | |
|---|---|
| Domains | Natural Language Processing, Data Engineering |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Sequence-to-sequence dataset loading prepares paired input-output text sequences for training by performing tokenization, vocabulary construction, length filtering, and optional input reversal.
Description
Sequence-to-sequence (Seq2Seq) models require specially prepared training data where each example consists of a pair of sequences: a source sequence (input) and a target sequence (output). The dataset loading pipeline transforms raw text data through several stages:
- Tokenization: Raw text is split into discrete tokens. For character-level models, each character becomes a token. For word-level models, text is split on whitespace and punctuation. Tokenization must be consistent between training and inference and may involve Unicode normalization, lowercasing, or other text cleaning operations.
- Vocabulary building: A mapping from tokens to integer indices is constructed from the training corpus. The vocabulary records the frequency of each token, enabling vocabulary pruning where infrequent tokens are replaced with a special unknown token. The vocabulary must include special tokens:
- SOS (Start of Sequence): Signals the beginning of the target sequence during decoding.
- EOS (End of Sequence): Signals the end of a sequence, telling the decoder when to stop generating.
- Length filtering: Sequences exceeding a maximum length are discarded or truncated. This serves both practical purposes (memory constraints, computational efficiency) and modeling purposes (extremely long sequences are rare and can destabilize training). Filtering is typically applied based on the length of both source and target sequences.
- Input reversal: Optionally, the source sequence is reversed before feeding it to the encoder. This technique, introduced in early Seq2Seq work, places the beginning of the source sequence closer to the beginning of the target sequence in terms of the number of recurrent steps, improving gradient flow and alignment for recurrent encoders.
Usage
Seq2Seq dataset loading is applied in machine translation, text summarization, dialogue generation, question answering, and any task that transforms one sequence into another. The quality of data preparation directly impacts model performance.
Theoretical Basis
Dataset Structure:
A Seq2Seq dataset consists of paired sequences:
where is the source sequence and is the target sequence.
Tokenization and Indexing:
A vocabulary function maps each token to a unique integer:
The indexed target sequence is augmented with special tokens:
Length Filtering:
Given maximum lengths and :
Input Reversal:
The reversed source sequence is:
Motivation: For an RNN encoder-decoder, the distance between source token (which typically aligns with target token ) and the decoder's first step is without reversal but only 2 with reversal. This reduces the effective path length for gradient propagation:
Batching:
Sequences within a batch are padded to the maximum length in the batch:
A padding mask indicates which positions are real tokens versus padding, preventing the model from attending to or computing loss on padded positions.