Workflow:Facebookresearch Audiocraft MusicGen Text To Music Inference

Knowledge Sources	AudioCraft MusicGen Docs MusicGen Paper
Domains	Audio_Generation, Music_Generation, Inference
Last Updated	2026-02-13 23:00 GMT

Overview

End-to-end process for generating music audio from text descriptions using pretrained MusicGen models with optional melody or style conditioning.

Description

This workflow covers the standard inference pipeline for MusicGen, the flagship text-to-music generation model in AudioCraft. It loads a pretrained model from HuggingFace Hub (or a local checkpoint), configures generation parameters (sampling strategy, duration, classifier-free guidance), runs autoregressive token generation conditioned on text descriptions, decodes tokens back to audio waveforms via EnCodec, and saves the output with loudness normalization. The workflow supports three generation modes: text-only, text+melody (chromagram conditioning), and text+style (audio style conditioning via MusicGen-Style).

Usage

Execute this workflow when you have text descriptions of desired music (e.g., "happy rock", "sad jazz") and need to generate corresponding audio waveforms. Optionally provide a melody audio file for melodic guidance or a style audio excerpt for stylistic conditioning. Requires a GPU with at least 16 GB of memory for medium-sized (1.5B parameter) models.

Execution Steps

Step 1: Environment Setup

Install the AudioCraft package and its dependencies including PyTorch 2.1.0, torchaudio, and ffmpeg. The package can be installed from PyPI as a stable release or directly from the GitHub repository for the latest version.

Key considerations:

Requires Python 3.9 and PyTorch 2.1.0
A GPU is mandatory for inference
ffmpeg must be available on the system for audio I/O
The xformers library is recommended for memory-efficient attention

Step 2: Load Pretrained Model

Load a pretrained MusicGen model using the high-level API. The loader fetches the language model checkpoint and the EnCodec compression model from HuggingFace Hub, builds both model components, and places them in eval mode on the target device.

Available models:

facebook/musicgen-small (300M) - text to music
facebook/musicgen-medium (1.5B) - text to music
facebook/musicgen-melody (1.5B) - text + melody to music
facebook/musicgen-large (3.3B) - text to music
facebook/musicgen-style (1.5B) - text + style to music

Key considerations:

Model weights are cached locally after first download
Cache location can be controlled via the AUDIOCRAFT_CACHE_DIR environment variable
The compression model (EnCodec) is automatically loaded alongside the language model

Step 3: Configure Generation Parameters

Set the generation parameters that control the sampling strategy and output characteristics. These include duration, sampling method (top-k, top-p, temperature), and classifier-free guidance coefficient.

Key parameters:

duration: length of generated audio in seconds (default 30, max depends on model)
use_sampling: whether to sample or use argmax decoding
top_k: number of top tokens to sample from (default 250)
temperature: softmax temperature for controlling randomness
cfg_coef: classifier-free guidance strength (default 3.0)
cfg_coef_beta: double CFG coefficient for MusicGen-Style (text+style mode only)

Step 4: Prepare Conditioning Inputs

Prepare the conditioning inputs based on the desired generation mode. For text-only generation, provide a list of text descriptions. For melody-conditioned generation, additionally load melody audio and convert it to the model's sample rate. For style-conditioned generation, configure the style conditioner parameters and provide a style audio excerpt.

Generation modes:

Text-only: call generate() with a list of description strings
Text+melody: call generate_with_chroma() with descriptions and melody waveform
Text+style: configure style conditioner, then call generate_with_chroma() with style audio
Unconditional: call generate_unconditional() for samples without any conditioning
Continuation: call generate_continuation() with a prompt audio waveform

Step 5: Run Token Generation

Execute the autoregressive generation loop. The language model generates discrete audio tokens conditioned on the prepared inputs. For durations exceeding the model's maximum window (typically 30 seconds), the generator uses a sliding window approach with configurable stride to produce extended sequences.

What happens:

Text descriptions are encoded via the T5 text encoder into conditioning embeddings
Melody/style audio (if provided) is processed by the appropriate conditioner
The transformer language model autoregressively generates codec tokens across 4 codebooks
Codebook interleaving pattern (delay pattern) allows parallel prediction of codebooks
Classifier-free guidance is applied by interpolating between conditioned and unconditioned predictions

Step 6: Decode Tokens to Audio

Convert the generated discrete tokens back into a continuous audio waveform using the EnCodec decoder. The tokens are first re-arranged from the codebook pattern back to the parallel layout, then fed through the EnCodec decoder to produce the final waveform.

Key considerations:

The EnCodec decoder maps 4 codebooks at 50 Hz back to 32 kHz audio
Output shape is [B, C, T] where B is batch size, C is channels (1 for mono, 2 for stereo)
Optional: MultiBand Diffusion can be used as an alternative decoder for enhanced audio quality

Step 7: Save Output Audio

Write the generated audio waveforms to disk with appropriate normalization. The audio_write utility supports multiple normalization strategies including loudness normalization to -14 dB LUFS and peak normalization.

Key considerations:

Default strategy is loudness normalization with compression
Output format is WAV by default
Each sample in the batch is saved as a separate file

Execution Diagram

GitHub URL

Workflow Repository