Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Facebookresearch Audiocraft Masked Parallel Token Generation

From Leeroopedia
Knowledge Sources
Domains Audio_Generation, Masked_Modeling
Last Updated 2026-02-14 01:00 GMT

Overview

A non-autoregressive generation strategy that iteratively unmasks discrete audio tokens in parallel using a cosine schedule, enabling faster inference than sequential autoregressive decoding.

Description

Masked Parallel Token Generation (MAGNeT) replaces traditional left-to-right autoregressive decoding with a parallel iterative approach. Starting from a fully masked sequence, the model predicts all tokens simultaneously, then retains the most confident predictions and re-masks the rest. This process repeats for a configurable number of steps, with the masking ratio following a cosine annealing schedule. Each codebook level in the residual vector quantizer is decoded independently, allowing the model to capture both coarse and fine audio structure.

Usage

Use this principle when designing or understanding non-autoregressive audio generation models that need faster inference than autoregressive approaches. It is the core generation algorithm behind MAGNeT models for text-to-music and text-to-sound generation.

Theoretical Basis

The masking schedule follows a cosine function:

γ(t)=cos(πt2)

At each decoding step t in [0, T], the fraction of tokens remaining masked is approximately gamma(t). Tokens are scored by their prediction confidence, and the most confident fraction (1 - gamma) is revealed.

Pseudo-code:

# Abstract decoding algorithm (NOT actual implementation)
tokens = MASK * ones(sequence_length)
for step in range(num_steps):
    logits = model(tokens, conditions)
    predictions = sample(logits)
    confidence = max(softmax(logits))
    mask_ratio = cos(pi * step / (2 * num_steps))
    num_to_mask = int(mask_ratio * sequence_length)
    least_confident = argsort(confidence)[:num_to_mask]
    tokens = predictions
    tokens[least_confident] = MASK

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment