Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Facebookresearch Audiocraft Audio Dataset Preparation

From Leeroopedia

Overview

Audio Dataset Preparation is the process of organizing, loading, and augmenting large-scale audio datasets with rich metadata for music generation training. In the MusicGen pipeline, raw audio files and their associated metadata (titles, artists, descriptions, genres, instruments, BPM, key) must be transformed into a format suitable for training an autoregressive language model on discrete audio tokens. This involves segment sampling from variable-length audio files, metadata-driven text augmentation, and probabilistic dropout of conditioning information to support classifier-free guidance during inference.

Theoretical Foundations

Segment Sampling from Variable-Length Audio

Music datasets contain audio files with widely varying durations -- from a few seconds to several minutes. Training a language model on fixed-length sequences requires extracting fixed-duration segments. The dataset preparation layer handles this by:

  • Duration-weighted sampling -- Files can be sampled with probability proportional to their duration, ensuring longer tracks contribute more training data.
  • Weight-based sampling -- Each file can carry an explicit weight for oversampling or undersampling specific subsets.
  • Random seek positioning -- Within a selected file, the start position is randomly chosen to maximize data diversity while respecting a minimum segment ratio constraint (ensuring segments are not mostly silence from padding).
  • Epoch-based determinism -- Randomization is seeded per-epoch so that the same data order can be reproduced for debugging, while still varying across epochs for training diversity.

Text Augmentation for Conditioning

MusicGen uses text descriptions as conditioning input. To improve robustness and generalization, the dataset layer provides several augmentation strategies:

  • Metadata merging (merge_text_p) -- With a given probability, structured metadata fields (genre, BPM, key, moods, instrument, keywords) are merged into the text description, creating richer conditioning inputs like "A happy pop song. genre: pop. bpm: 120. key: C major".
  • Description dropout (drop_desc_p) -- The original description can be probabilistically dropped when metadata is merged, forcing the model to learn from structured fields alone.
  • Other field dropout (drop_other_p) -- Individual metadata fields can be dropped during merging, preventing the model from relying on any single field.
  • Paraphrasing -- Pre-computed paraphrases of descriptions can be substituted at a configurable probability, increasing text diversity without manual annotation.

These augmentations are critical for training models that generalize well to diverse user prompts during inference.

Classifier-Free Guidance Support

The dataset does not directly implement classifier-free guidance (CFG) dropout -- that is handled by the model's cfg_dropout and att_dropout methods. However, the dataset's text augmentation (dropping descriptions, merging metadata) works synergistically with CFG by ensuring the model sees a wide variety of conditioning quality levels during training.

Key Principles

  • Manifest-driven loading -- Audio metadata is stored in JSONL manifest files (data.jsonl or data.jsonl.gz) containing paths, durations, sample rates, and optional music metadata. This decouples data discovery from data loading.
  • Layered inheritance -- MusicDataset extends InfoAudioDataset, which extends AudioDataset. Base audio loading, resampling, and segment extraction are handled by the parent classes; music-specific metadata loading and augmentation are added by MusicDataset.
  • Sidecar JSON metadata -- For each audio file, a companion .json file contains music-specific metadata (title, artist, genre, BPM, etc.). This is loaded on-the-fly during __getitem__.
  • Stochastic augmentation -- All text augmentation is probabilistic and seeded, ensuring reproducibility while maximizing diversity.

Role in the MusicGen Training Pipeline

Dataset preparation is the second stage of the pipeline (after environment configuration). The prepared dataset provides:

  1. Audio tensors -- Resampled, channel-converted, fixed-duration segments ready for tokenization by the compression model.
  2. MusicInfo metadata -- Rich structured metadata converted to ConditioningAttributes for the conditioning system. Each MusicInfo object carries text fields (description, title, artist, genre), numeric fields (BPM), and optional wav conditions (self_wav for melody/style conditioning).

The dataloader yields tuples of (torch.Tensor, List[MusicInfo]) that are consumed by the solver's run_step method.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment