Principle:Facebookresearch Audiocraft Diffusion UNet Architecture

Knowledge Sources	Multi-Band Diffusion DDPM
Domains	Diffusion, Model_Architecture
Last Updated	2026-02-14 01:00 GMT

Overview

A U-Net neural network architecture adapted for 1D audio that serves as the score estimator in diffusion-based audio generation and enhancement models.

Description

The Diffusion U-Net Architecture is a convolutional encoder-decoder with skip connections, adapted from image diffusion models to operate on 1D audio signals. At each resolution level, a Transformer block provides self-attention for capturing long-range temporal dependencies. The diffusion timestep is embedded via sinusoidal encoding and injected as a modulation signal. Optional cross-attention layers enable conditioning on external signals such as codec latent embeddings.

Usage

Use this principle when designing score-based diffusion models for audio. It is the standard denoising architecture for Multi-Band Diffusion systems that convert discrete audio tokens to high-fidelity waveforms.

Theoretical Basis

The U-Net estimates the score function (gradient of the log probability):

$s_{θ} (x_{t}, t) \approx \nabla_{x_{t}} \log p_{t} (x_{t})$

The architecture processes the noisy input through progressively lower resolutions (encoder), applies attention at each level, then upsamples (decoder) with skip connections from matching encoder levels.

Related Pages

Implementation:Facebookresearch_Audiocraft_DiffusionUnet

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment