Principle:Facebookresearch Audiocraft Diffusion UNet Architecture
| Knowledge Sources | |
|---|---|
| Domains | Diffusion, Model_Architecture |
| Last Updated | 2026-02-14 01:00 GMT |
Overview
A U-Net neural network architecture adapted for 1D audio that serves as the score estimator in diffusion-based audio generation and enhancement models.
Description
The Diffusion U-Net Architecture is a convolutional encoder-decoder with skip connections, adapted from image diffusion models to operate on 1D audio signals. At each resolution level, a Transformer block provides self-attention for capturing long-range temporal dependencies. The diffusion timestep is embedded via sinusoidal encoding and injected as a modulation signal. Optional cross-attention layers enable conditioning on external signals such as codec latent embeddings.
Usage
Use this principle when designing score-based diffusion models for audio. It is the standard denoising architecture for Multi-Band Diffusion systems that convert discrete audio tokens to high-fidelity waveforms.
Theoretical Basis
The U-Net estimates the score function (gradient of the log probability):
The architecture processes the noisy input through progressively lower resolutions (encoder), applies attention at each level, then upsamples (decoder) with skip connections from matching encoder levels.