Principle:Facebookresearch Audiocraft Masked Parallel Token Generation
| Knowledge Sources | |
|---|---|
| Domains | Audio_Generation, Masked_Modeling |
| Last Updated | 2026-02-14 01:00 GMT |
Overview
A non-autoregressive generation strategy that iteratively unmasks discrete audio tokens in parallel using a cosine schedule, enabling faster inference than sequential autoregressive decoding.
Description
Masked Parallel Token Generation (MAGNeT) replaces traditional left-to-right autoregressive decoding with a parallel iterative approach. Starting from a fully masked sequence, the model predicts all tokens simultaneously, then retains the most confident predictions and re-masks the rest. This process repeats for a configurable number of steps, with the masking ratio following a cosine annealing schedule. Each codebook level in the residual vector quantizer is decoded independently, allowing the model to capture both coarse and fine audio structure.
Usage
Use this principle when designing or understanding non-autoregressive audio generation models that need faster inference than autoregressive approaches. It is the core generation algorithm behind MAGNeT models for text-to-music and text-to-sound generation.
Theoretical Basis
The masking schedule follows a cosine function:
At each decoding step t in [0, T], the fraction of tokens remaining masked is approximately gamma(t). Tokens are scored by their prediction confidence, and the most confident fraction (1 - gamma) is revealed.
Pseudo-code:
# Abstract decoding algorithm (NOT actual implementation)
tokens = MASK * ones(sequence_length)
for step in range(num_steps):
logits = model(tokens, conditions)
predictions = sample(logits)
confidence = max(softmax(logits))
mask_ratio = cos(pi * step / (2 * num_steps))
num_to_mask = int(mask_ratio * sequence_length)
least_confident = argsort(confidence)[:num_to_mask]
tokens = predictions
tokens[least_confident] = MASK