Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Facebookresearch Audiocraft MusicGen Text To Music Inference

From Leeroopedia
Knowledge Sources
Domains Audio_Generation, Music_Generation, Inference
Last Updated 2026-02-13 23:00 GMT

Overview

End-to-end process for generating music audio from text descriptions using pretrained MusicGen models with optional melody or style conditioning.

Description

This workflow covers the standard inference pipeline for MusicGen, the flagship text-to-music generation model in AudioCraft. It loads a pretrained model from HuggingFace Hub (or a local checkpoint), configures generation parameters (sampling strategy, duration, classifier-free guidance), runs autoregressive token generation conditioned on text descriptions, decodes tokens back to audio waveforms via EnCodec, and saves the output with loudness normalization. The workflow supports three generation modes: text-only, text+melody (chromagram conditioning), and text+style (audio style conditioning via MusicGen-Style).

Usage

Execute this workflow when you have text descriptions of desired music (e.g., "happy rock", "sad jazz") and need to generate corresponding audio waveforms. Optionally provide a melody audio file for melodic guidance or a style audio excerpt for stylistic conditioning. Requires a GPU with at least 16 GB of memory for medium-sized (1.5B parameter) models.

Execution Steps

Step 1: Environment Setup

Install the AudioCraft package and its dependencies including PyTorch 2.1.0, torchaudio, and ffmpeg. The package can be installed from PyPI as a stable release or directly from the GitHub repository for the latest version.

Key considerations:

  • Requires Python 3.9 and PyTorch 2.1.0
  • A GPU is mandatory for inference
  • ffmpeg must be available on the system for audio I/O
  • The xformers library is recommended for memory-efficient attention

Step 2: Load Pretrained Model

Load a pretrained MusicGen model using the high-level API. The loader fetches the language model checkpoint and the EnCodec compression model from HuggingFace Hub, builds both model components, and places them in eval mode on the target device.

Available models:

  • facebook/musicgen-small (300M) - text to music
  • facebook/musicgen-medium (1.5B) - text to music
  • facebook/musicgen-melody (1.5B) - text + melody to music
  • facebook/musicgen-large (3.3B) - text to music
  • facebook/musicgen-style (1.5B) - text + style to music

Key considerations:

  • Model weights are cached locally after first download
  • Cache location can be controlled via the AUDIOCRAFT_CACHE_DIR environment variable
  • The compression model (EnCodec) is automatically loaded alongside the language model

Step 3: Configure Generation Parameters

Set the generation parameters that control the sampling strategy and output characteristics. These include duration, sampling method (top-k, top-p, temperature), and classifier-free guidance coefficient.

Key parameters:

  • duration: length of generated audio in seconds (default 30, max depends on model)
  • use_sampling: whether to sample or use argmax decoding
  • top_k: number of top tokens to sample from (default 250)
  • temperature: softmax temperature for controlling randomness
  • cfg_coef: classifier-free guidance strength (default 3.0)
  • cfg_coef_beta: double CFG coefficient for MusicGen-Style (text+style mode only)

Step 4: Prepare Conditioning Inputs

Prepare the conditioning inputs based on the desired generation mode. For text-only generation, provide a list of text descriptions. For melody-conditioned generation, additionally load melody audio and convert it to the model's sample rate. For style-conditioned generation, configure the style conditioner parameters and provide a style audio excerpt.

Generation modes:

  • Text-only: call generate() with a list of description strings
  • Text+melody: call generate_with_chroma() with descriptions and melody waveform
  • Text+style: configure style conditioner, then call generate_with_chroma() with style audio
  • Unconditional: call generate_unconditional() for samples without any conditioning
  • Continuation: call generate_continuation() with a prompt audio waveform

Step 5: Run Token Generation

Execute the autoregressive generation loop. The language model generates discrete audio tokens conditioned on the prepared inputs. For durations exceeding the model's maximum window (typically 30 seconds), the generator uses a sliding window approach with configurable stride to produce extended sequences.

What happens:

  • Text descriptions are encoded via the T5 text encoder into conditioning embeddings
  • Melody/style audio (if provided) is processed by the appropriate conditioner
  • The transformer language model autoregressively generates codec tokens across 4 codebooks
  • Codebook interleaving pattern (delay pattern) allows parallel prediction of codebooks
  • Classifier-free guidance is applied by interpolating between conditioned and unconditioned predictions

Step 6: Decode Tokens to Audio

Convert the generated discrete tokens back into a continuous audio waveform using the EnCodec decoder. The tokens are first re-arranged from the codebook pattern back to the parallel layout, then fed through the EnCodec decoder to produce the final waveform.

Key considerations:

  • The EnCodec decoder maps 4 codebooks at 50 Hz back to 32 kHz audio
  • Output shape is [B, C, T] where B is batch size, C is channels (1 for mono, 2 for stereo)
  • Optional: MultiBand Diffusion can be used as an alternative decoder for enhanced audio quality

Step 7: Save Output Audio

Write the generated audio waveforms to disk with appropriate normalization. The audio_write utility supports multiple normalization strategies including loudness normalization to -14 dB LUFS and peak normalization.

Key considerations:

  • Default strategy is loudness normalization with compression
  • Output format is WAV by default
  • Each sample in the batch is saved as a separate file

Execution Diagram

GitHub URL

Workflow Repository