Implementation:Facebookresearch Audiocraft DiffusionUnet
| Knowledge Sources | |
|---|---|
| Domains | Audio_Generation, Diffusion, Model_Architecture |
| Last Updated | 2026-02-14 01:00 GMT |
Overview
Concrete tool for denoising audio signals via a 1D U-Net architecture with Transformer blocks used in the Multi-Band Diffusion system.
Description
DiffusionUnet implements a 1D convolutional U-Net architecture designed for score-based diffusion models operating on audio. It features an encoder-decoder structure with skip connections, where each resolution level contains a StreamingTransformer block for self-attention and optional cross-attention (conditioned on codec embeddings). The diffusion timestep is embedded via a sinusoidal positional encoding and injected as a FiLM-style modulation at each level.
Usage
Import this class when building diffusion-based audio enhancement or generation models within AudioCraft. It serves as the score model (denoiser) in the Multi-Band Diffusion pipeline that converts discrete audio tokens to high-fidelity waveforms.
Code Reference
Source Location
- Repository: Facebookresearch_Audiocraft
- File: audiocraft/models/unet.py
- Lines: 1-214
Signature
class DiffusionUnet(nn.Module):
def __init__(self, chin: int = 1, hidden: int = 512, depth: int = 7,
growth: float = 1, max_channels: int = 10000,
num_heads: int = 4, cross_attention: bool = False,
bilevel: bool = False, codec_dim: tp.Optional[int] = None,
**kwargs):
"""
Args:
chin: Number of input audio channels.
hidden: Initial hidden dimension.
depth: Number of encoder/decoder levels.
growth: Channel growth factor per level.
max_channels: Maximum channels at any level.
num_heads: Attention heads in transformer blocks.
cross_attention: Enable cross-attention for conditioning.
bilevel: Use two-level scheduling.
codec_dim: Dimension of codec conditioning embeddings.
"""
Import
from audiocraft.models.unet import DiffusionUnet
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| x | torch.Tensor | Yes | Noisy audio input [B, C, T] |
| step | torch.Tensor | Yes | Diffusion timestep [B] or scalar |
| condition | torch.Tensor | No | Codec conditioning embeddings [B, D, T'] |
Outputs
| Name | Type | Description |
|---|---|---|
| sample | torch.Tensor | Denoised output [B, C, T] |
Usage Examples
As Part of Multi-Band Diffusion
from audiocraft.models import MultiBandDiffusion
# DiffusionUnet is loaded internally by MultiBandDiffusion
mbd = MultiBandDiffusion.get_mbd_musicgen()
# Generate from codec tokens
wav = mbd.tokens_to_wav(tokens)