Principle:Lucidrains X transformers Masked Prediction Data Preparation

Field	Value
Repo	x-transformers
Domains	Data_Engineering, Generative_Models
Last Updated	2026-02-08 18:00 GMT

Overview

Data preparation pattern for creating fixed-length token sequence datasets suitable for non-autoregressive masked prediction training.

Description

Non-autoregressive training requires datasets that yield complete (unmasked) token sequences of exactly max_seq_len length. Masking is applied internally by the NonAutoregressiveWrapper. Unlike autoregressive data preparation, no extra token is needed because there is no shifted-by-one target relationship.

The key requirements are:

Sequences must be exactly max_seq_len tokens long (no padding, no variable length).
Sequences must contain unmasked integer token IDs. The wrapper handles all masking internally using a schedule-based masking strategy.
Token values must be in the range [0, num_tokens - 1], and must not include the mask_id token (which is reserved for the masking mechanism).
No reference training script exists in the repository; the interface is derived from NonAutoregressiveWrapper.forward() requirements (specifically the assertion assert n == self.max_seq_len at the start of the forward method).

Usage

Use this pattern when preparing data for NonAutoregressiveWrapper training. Specifically:

Tokenize your corpus into integer token IDs.
Create a Dataset that returns sequences of exactly max_seq_len tokens.
Do not apply any masking to the data — the wrapper handles this.
Ensure that the special mask_id token does not appear in your data.

Theoretical Basis

Non-autoregressive masked prediction models (such as MaskGIT and related approaches) learn to predict masked tokens given the surrounding context. During training, a random subset of tokens is replaced with a mask token, and the model predicts the original values. The masking schedule (e.g., linear, cosine) controls the fraction of tokens masked at each step.

Because masking is a stochastic process applied at training time, the dataset should provide clean, unmasked sequences. This separation of data from augmentation allows the wrapper to:

Apply different masking ratios per sample within a batch.
Implement self-conditioning by re-masking model predictions.
Support both generator training and optional critic training.

Related Pages

Implementation:Lucidrains_X_transformers_Masked_Dataset_Pattern

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment