Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Lucidrains X transformers Masked Prediction Data Preparation

From Leeroopedia


Field Value
Repo x-transformers
Domains Data_Engineering, Generative_Models
Last Updated 2026-02-08 18:00 GMT

Overview

Data preparation pattern for creating fixed-length token sequence datasets suitable for non-autoregressive masked prediction training.

Description

Non-autoregressive training requires datasets that yield complete (unmasked) token sequences of exactly max_seq_len length. Masking is applied internally by the NonAutoregressiveWrapper. Unlike autoregressive data preparation, no extra token is needed because there is no shifted-by-one target relationship.

The key requirements are:

  • Sequences must be exactly max_seq_len tokens long (no padding, no variable length).
  • Sequences must contain unmasked integer token IDs. The wrapper handles all masking internally using a schedule-based masking strategy.
  • Token values must be in the range [0, num_tokens - 1], and must not include the mask_id token (which is reserved for the masking mechanism).
  • No reference training script exists in the repository; the interface is derived from NonAutoregressiveWrapper.forward() requirements (specifically the assertion assert n == self.max_seq_len at the start of the forward method).

Usage

Use this pattern when preparing data for NonAutoregressiveWrapper training. Specifically:

  • Tokenize your corpus into integer token IDs.
  • Create a Dataset that returns sequences of exactly max_seq_len tokens.
  • Do not apply any masking to the data — the wrapper handles this.
  • Ensure that the special mask_id token does not appear in your data.

Theoretical Basis

Non-autoregressive masked prediction models (such as MaskGIT and related approaches) learn to predict masked tokens given the surrounding context. During training, a random subset of tokens is replaced with a mask token, and the model predicts the original values. The masking schedule (e.g., linear, cosine) controls the fraction of tokens masked at each step.

Because masking is a stochastic process applied at training time, the dataset should provide clean, unmasked sequences. This separation of data from augmentation allows the wrapper to:

  • Apply different masking ratios per sample within a batch.
  • Implement self-conditioning by re-masking model predictions.
  • Support both generator training and optional critic training.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment