Principle:Tensorflow Tfjs Sequence Preprocessing
Summary
Sequence preprocessing prepares tokenized text sequences for transformer model input by padding, truncating, and adding special tokens. This is a library-agnostic concept: sequence preprocessing transforms variable-length token sequences into fixed-length tensors with appropriate start/end markers and attention masks, enabling efficient batched computation in transformer architectures.
Theory
Transformer models require fixed-length input sequences for efficient batched processing. Sequence preprocessing bridges the gap between variable-length natural language text and the rigid tensor shapes required by neural network computation.
The preprocessing pipeline involves the following steps:
- Tokenize raw text into token IDs using a tokenizer (e.g., BPE).
- Prepend start token (e.g.,
<|endoftext|>for GPT-2) to mark the beginning of the sequence. - Append end token to signal the end of meaningful content.
- Pad short sequences with pad tokens to reach the target sequence length.
- Truncate long sequences that exceed the maximum sequence length.
- Generate padding mask (1 for real tokens, 0 for padding) so that the attention mechanism ignores pad positions.
Padding Mask
The padding mask is a binary tensor of the same shape as the token ID tensor:
| Position | Token | Mask Value |
|---|---|---|
| Real token | Any valid token ID | 1 |
| Padding | Pad token ID | 0 |
The attention mechanism multiplies attention scores by the mask, ensuring that padded positions receive zero attention weight and do not influence the model's computations.
Sequence Layout
A preprocessed sequence of target length L has the following layout:
[START] [tok_1] [tok_2] ... [tok_n] [END] [PAD] [PAD] ... [PAD]
Where:
[START]is the start-of-sequence token[tok_1] ... [tok_n]are the content tokens[END]is the end-of-sequence token[PAD]tokens fill the remaining positions up to length L
Key Properties
- Fixed-length output: All sequences in a batch have the same length, enabling efficient tensor operations.
- Masking: The padding mask ensures that padded positions do not influence model computation.
- Configurable: Start/end token insertion can be toggled on or off depending on the use case.
- Bidirectional: Preprocessing handles both encoding (text → tensors) and the inverse (tensors → text via detokenization).
Implementation
Implementation:Tensorflow_Tfjs_GPT2Preprocessor_Constructor