Principle:LaurentMazare Tch rs Seq2Seq Dataset Loading

Knowledge Sources	LaurentMazare_Tch_rs Sequence to Sequence Learning with Neural Networks
Domains	Natural Language Processing, Data Engineering
Last Updated	2026-02-08 00:00 GMT

Overview

Sequence-to-sequence dataset loading prepares paired input-output text sequences for training by performing tokenization, vocabulary construction, length filtering, and optional input reversal.

Description

Sequence-to-sequence (Seq2Seq) models require specially prepared training data where each example consists of a pair of sequences: a source sequence (input) and a target sequence (output). The dataset loading pipeline transforms raw text data through several stages:

Tokenization: Raw text is split into discrete tokens. For character-level models, each character becomes a token. For word-level models, text is split on whitespace and punctuation. Tokenization must be consistent between training and inference and may involve Unicode normalization, lowercasing, or other text cleaning operations.

Vocabulary building: A mapping from tokens to integer indices is constructed from the training corpus. The vocabulary records the frequency of each token, enabling vocabulary pruning where infrequent tokens are replaced with a special unknown token. The vocabulary must include special tokens:
- SOS (Start of Sequence): Signals the beginning of the target sequence during decoding.
- EOS (End of Sequence): Signals the end of a sequence, telling the decoder when to stop generating.

Length filtering: Sequences exceeding a maximum length are discarded or truncated. This serves both practical purposes (memory constraints, computational efficiency) and modeling purposes (extremely long sequences are rare and can destabilize training). Filtering is typically applied based on the length of both source and target sequences.

Input reversal: Optionally, the source sequence is reversed before feeding it to the encoder. This technique, introduced in early Seq2Seq work, places the beginning of the source sequence closer to the beginning of the target sequence in terms of the number of recurrent steps, improving gradient flow and alignment for recurrent encoders.

Usage

Seq2Seq dataset loading is applied in machine translation, text summarization, dialogue generation, question answering, and any task that transforms one sequence into another. The quality of data preparation directly impacts model performance.

Theoretical Basis

Dataset Structure:

A Seq2Seq dataset $D$ consists of paired sequences:

$D = {(x^{(i)}, y^{(i)})}_{i = 1}^{N}$

where $x^{(i)} = (x_{1}^{(i)}, x_{2}^{(i)}, \dots, x_{T_{x}}^{(i)})$ is the source sequence and $y^{(i)} = (y_{1}^{(i)}, y_{2}^{(i)}, \dots, y_{T_{y}}^{(i)})$ is the target sequence.

Tokenization and Indexing:

A vocabulary function $V : token \to ℤ^{+}$ maps each token to a unique integer:

${\bar{x}}_{t}^{(i)} = V (x_{t}^{(i)})$

The indexed target sequence is augmented with special tokens:

${\bar{y}}^{(i)} = (SOS, V (y_{1}^{(i)}), V (y_{2}^{(i)}), \dots, V (y_{T_{y}}^{(i)}), EOS)$

Length Filtering:

Given maximum lengths $L_{m a x}^{s r c}$ and $L_{m a x}^{t g t}$ :

$D_{f i l t e r e d} = {(x, y) \in D : | x | \leq L_{m a x}^{s r c} and | y | \leq L_{m a x}^{t g t}}$

Input Reversal:

The reversed source sequence is:

${\bar{x}}_{r e v}^{(i)} = (x_{T_{x}}^{(i)}, x_{T_{x} - 1}^{(i)}, \dots, x_{1}^{(i)})$

Motivation: For an RNN encoder-decoder, the distance between source token $x_{1}$ (which typically aligns with target token $y_{1}$ ) and the decoder's first step is $T_{x} + 1$ without reversal but only 2 with reversal. This reduces the effective path length for gradient propagation:

$Path length without reversal: T_{x} + T_{y}$

$Path length with reversal: \max (T_{x}, T_{y}) + \min (T_{x}, T_{y}) (but first tokens are close)$

Batching:

Sequences within a batch are padded to the maximum length in the batch:

${\bar{x}}_{p a d d e d}^{(i)} = ({\bar{x}}_{1}^{(i)}, \dots, {\bar{x}}_{T_{x}}^{(i)}, PAD, \dots, PAD)$

A padding mask indicates which positions are real tokens versus padding, preventing the model from attending to or computing loss on padded positions.

Related Pages

Implementation:LaurentMazare_Tch_rs_Translation_Dataset

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment