Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Facebookresearch Audiocraft DiffusionUnet

From Leeroopedia
Knowledge Sources
Domains Audio_Generation, Diffusion, Model_Architecture
Last Updated 2026-02-14 01:00 GMT

Overview

Concrete tool for denoising audio signals via a 1D U-Net architecture with Transformer blocks used in the Multi-Band Diffusion system.

Description

DiffusionUnet implements a 1D convolutional U-Net architecture designed for score-based diffusion models operating on audio. It features an encoder-decoder structure with skip connections, where each resolution level contains a StreamingTransformer block for self-attention and optional cross-attention (conditioned on codec embeddings). The diffusion timestep is embedded via a sinusoidal positional encoding and injected as a FiLM-style modulation at each level.

Usage

Import this class when building diffusion-based audio enhancement or generation models within AudioCraft. It serves as the score model (denoiser) in the Multi-Band Diffusion pipeline that converts discrete audio tokens to high-fidelity waveforms.

Code Reference

Source Location

Signature

class DiffusionUnet(nn.Module):
    def __init__(self, chin: int = 1, hidden: int = 512, depth: int = 7,
                 growth: float = 1, max_channels: int = 10000,
                 num_heads: int = 4, cross_attention: bool = False,
                 bilevel: bool = False, codec_dim: tp.Optional[int] = None,
                 **kwargs):
        """
        Args:
            chin: Number of input audio channels.
            hidden: Initial hidden dimension.
            depth: Number of encoder/decoder levels.
            growth: Channel growth factor per level.
            max_channels: Maximum channels at any level.
            num_heads: Attention heads in transformer blocks.
            cross_attention: Enable cross-attention for conditioning.
            bilevel: Use two-level scheduling.
            codec_dim: Dimension of codec conditioning embeddings.
        """

Import

from audiocraft.models.unet import DiffusionUnet

I/O Contract

Inputs

Name Type Required Description
x torch.Tensor Yes Noisy audio input [B, C, T]
step torch.Tensor Yes Diffusion timestep [B] or scalar
condition torch.Tensor No Codec conditioning embeddings [B, D, T']

Outputs

Name Type Description
sample torch.Tensor Denoised output [B, C, T]

Usage Examples

As Part of Multi-Band Diffusion

from audiocraft.models import MultiBandDiffusion

# DiffusionUnet is loaded internally by MultiBandDiffusion
mbd = MultiBandDiffusion.get_mbd_musicgen()

# Generate from codec tokens
wav = mbd.tokens_to_wav(tokens)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment