Principle:Lucidrains X transformers Masked Prediction Data Preparation
| Field | Value |
|---|---|
| Repo | x-transformers |
| Domains | Data_Engineering, Generative_Models |
| Last Updated | 2026-02-08 18:00 GMT |
Overview
Data preparation pattern for creating fixed-length token sequence datasets suitable for non-autoregressive masked prediction training.
Description
Non-autoregressive training requires datasets that yield complete (unmasked) token sequences of exactly max_seq_len length. Masking is applied internally by the NonAutoregressiveWrapper. Unlike autoregressive data preparation, no extra token is needed because there is no shifted-by-one target relationship.
The key requirements are:
- Sequences must be exactly
max_seq_lentokens long (no padding, no variable length). - Sequences must contain unmasked integer token IDs. The wrapper handles all masking internally using a schedule-based masking strategy.
- Token values must be in the range
[0, num_tokens - 1], and must not include themask_idtoken (which is reserved for the masking mechanism). - No reference training script exists in the repository; the interface is derived from
NonAutoregressiveWrapper.forward()requirements (specifically the assertionassert n == self.max_seq_lenat the start of the forward method).
Usage
Use this pattern when preparing data for NonAutoregressiveWrapper training. Specifically:
- Tokenize your corpus into integer token IDs.
- Create a
Datasetthat returns sequences of exactlymax_seq_lentokens. - Do not apply any masking to the data — the wrapper handles this.
- Ensure that the special
mask_idtoken does not appear in your data.
Theoretical Basis
Non-autoregressive masked prediction models (such as MaskGIT and related approaches) learn to predict masked tokens given the surrounding context. During training, a random subset of tokens is replaced with a mask token, and the model predicts the original values. The masking schedule (e.g., linear, cosine) controls the fraction of tokens masked at each step.
Because masking is a stochastic process applied at training time, the dataset should provide clean, unmasked sequences. This separation of data from augmentation allows the wrapper to:
- Apply different masking ratios per sample within a batch.
- Implement self-conditioning by re-masking model predictions.
- Support both generator training and optional critic training.