Principle:Facebookresearch Audiocraft Audio Dataset Preparation
Overview
Audio Dataset Preparation is the process of organizing, loading, and augmenting large-scale audio datasets with rich metadata for music generation training. In the MusicGen pipeline, raw audio files and their associated metadata (titles, artists, descriptions, genres, instruments, BPM, key) must be transformed into a format suitable for training an autoregressive language model on discrete audio tokens. This involves segment sampling from variable-length audio files, metadata-driven text augmentation, and probabilistic dropout of conditioning information to support classifier-free guidance during inference.
Theoretical Foundations
Segment Sampling from Variable-Length Audio
Music datasets contain audio files with widely varying durations -- from a few seconds to several minutes. Training a language model on fixed-length sequences requires extracting fixed-duration segments. The dataset preparation layer handles this by:
- Duration-weighted sampling -- Files can be sampled with probability proportional to their duration, ensuring longer tracks contribute more training data.
- Weight-based sampling -- Each file can carry an explicit weight for oversampling or undersampling specific subsets.
- Random seek positioning -- Within a selected file, the start position is randomly chosen to maximize data diversity while respecting a minimum segment ratio constraint (ensuring segments are not mostly silence from padding).
- Epoch-based determinism -- Randomization is seeded per-epoch so that the same data order can be reproduced for debugging, while still varying across epochs for training diversity.
Text Augmentation for Conditioning
MusicGen uses text descriptions as conditioning input. To improve robustness and generalization, the dataset layer provides several augmentation strategies:
- Metadata merging (
merge_text_p) -- With a given probability, structured metadata fields (genre, BPM, key, moods, instrument, keywords) are merged into the text description, creating richer conditioning inputs like "A happy pop song. genre: pop. bpm: 120. key: C major". - Description dropout (
drop_desc_p) -- The original description can be probabilistically dropped when metadata is merged, forcing the model to learn from structured fields alone. - Other field dropout (
drop_other_p) -- Individual metadata fields can be dropped during merging, preventing the model from relying on any single field. - Paraphrasing -- Pre-computed paraphrases of descriptions can be substituted at a configurable probability, increasing text diversity without manual annotation.
These augmentations are critical for training models that generalize well to diverse user prompts during inference.
Classifier-Free Guidance Support
The dataset does not directly implement classifier-free guidance (CFG) dropout -- that is handled by the model's cfg_dropout and att_dropout methods. However, the dataset's text augmentation (dropping descriptions, merging metadata) works synergistically with CFG by ensuring the model sees a wide variety of conditioning quality levels during training.
Key Principles
- Manifest-driven loading -- Audio metadata is stored in JSONL manifest files (
data.jsonlordata.jsonl.gz) containing paths, durations, sample rates, and optional music metadata. This decouples data discovery from data loading. - Layered inheritance --
MusicDatasetextendsInfoAudioDataset, which extendsAudioDataset. Base audio loading, resampling, and segment extraction are handled by the parent classes; music-specific metadata loading and augmentation are added byMusicDataset. - Sidecar JSON metadata -- For each audio file, a companion
.jsonfile contains music-specific metadata (title, artist, genre, BPM, etc.). This is loaded on-the-fly during__getitem__. - Stochastic augmentation -- All text augmentation is probabilistic and seeded, ensuring reproducibility while maximizing diversity.
Role in the MusicGen Training Pipeline
Dataset preparation is the second stage of the pipeline (after environment configuration). The prepared dataset provides:
- Audio tensors -- Resampled, channel-converted, fixed-duration segments ready for tokenization by the compression model.
- MusicInfo metadata -- Rich structured metadata converted to
ConditioningAttributesfor the conditioning system. EachMusicInfoobject carries text fields (description, title, artist, genre), numeric fields (BPM), and optional wav conditions (self_wav for melody/style conditioning).
The dataloader yields tuples of (torch.Tensor, List[MusicInfo]) that are consumed by the solver's run_step method.