Implementation:Facebookresearch Audiocraft MusicDataset init
Overview
MusicDataset is the primary dataset class for MusicGen training. It inherits from InfoAudioDataset (which inherits from AudioDataset) and adds music-specific metadata loading, text augmentation, paraphrasing, and joint embedding support. The __init__ method configures augmentation parameters, while __getitem__ loads audio segments alongside structured MusicInfo metadata.
Source Location
| Property | Value |
|---|---|
| Source file | audiocraft/data/music_dataset.py lines 187-249
|
| Base class source | audiocraft/data/audio_dataset.py lines 244-559
|
| Import | from audiocraft.data.music_dataset import MusicDataset
|
| Module | audiocraft.data.music_dataset
|
API
Constructor
MusicDataset.__init__(
*args,
info_fields_required: bool = True,
merge_text_p: float = 0.,
drop_desc_p: float = 0.,
drop_other_p: float = 0.,
joint_embed_attributes: List[str] = [],
paraphrase_source: Optional[str] = None,
paraphrase_p: float = 0,
**kwargs
)
Factory Method
AudioDataset.from_meta(
root: Union[str, Path],
**kwargs
) -> AudioDataset
Instantiates the dataset from a directory containing a data.jsonl or data.jsonl.gz manifest file.
Item Access
def __getitem__(self, index) -> Tuple[torch.Tensor, MusicInfo]
Returns a tuple of the audio waveform tensor and a MusicInfo dataclass populated from the sidecar JSON.
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
info_fields_required |
bool |
True |
Whether required metadata fields must be present in sidecar JSON |
merge_text_p |
float |
0.0 |
Probability of merging structured metadata into the description |
drop_desc_p |
float |
0.0 |
Probability of dropping the original description during merge |
drop_other_p |
float |
0.0 |
Probability of dropping individual metadata fields during merge |
joint_embed_attributes |
List[str] |
[] |
Attribute names for which joint embedding conditions are created |
paraphrase_source |
Optional[str] |
None |
Path to JSON/JSON.GZ file with paraphrased descriptions |
paraphrase_p |
float |
0 |
Probability of using a paraphrase instead of original description |
Inherited Key Parameters (from AudioDataset)
| Parameter | Type | Default | Description |
|---|---|---|---|
segment_duration |
Optional[float] |
None |
Duration of audio segments to sample (typically 30s for MusicGen) |
sample_rate |
int |
48000 |
Target sample rate (32000 for MusicGen base) |
channels |
int |
2 |
Target channels (1 for mono MusicGen) |
num_samples |
int |
10000 |
Number of samples per epoch |
sample_on_duration |
bool |
True |
Sample files proportional to duration |
sample_on_weight |
bool |
True |
Sample files proportional to weight |
min_segment_ratio |
float |
0.5 |
Minimum ratio of actual audio in a padded segment |
shuffle |
bool |
True |
Shuffle data each epoch |
pad |
bool |
True |
Pad short segments to target duration |
Inputs and Outputs
Inputs:
- JSONL manifest files with audio metadata (path, duration, sample_rate, and optionally title, artist, description, genre, key, bpm, moods, keywords, instrument, name)
- Sidecar
.jsonfiles alongside each audio file containing music-specific metadata - Optional paraphrase JSON file
Outputs:
Tuple[torch.Tensor, MusicInfo]per sample where:torch.Tensor-- audio waveform of shape[C, T](channels, time samples)MusicInfo-- dataclass with all metadata fields and ato_condition_attributes()method that converts toConditioningAttributesfor the model
MusicInfo Dataclass
@dataclass
class MusicInfo(AudioInfo):
title: Optional[str] = None
artist: Optional[str] = None
key: Optional[str] = None
bpm: Optional[float] = None
genre: Optional[str] = None
moods: Optional[list] = None
keywords: Optional[list] = None
description: Optional[str] = None
name: Optional[str] = None
instrument: Optional[str] = None
self_wav: Optional[WavCondition] = None
joint_embed: Dict[str, JointEmbedCondition] = field(default_factory=dict)
Internal Flow
The __getitem__ method performs these steps:
- Calls parent
InfoAudioDataset.__getitem__to load audio segment and basic info - Loads sidecar
.jsonmusic metadata viaMusicInfo.from_dict() - Optionally applies paraphrasing to the description
- Applies text augmentation via
augment_music_info_description()ifmerge_text_p > 0 - Attaches
self_wavas aWavCondition(the audio itself, used for melody/style conditioning) - Creates
JointEmbedConditionentries for any requested joint embed attributes
Dependencies
torch-- tensor operations, paddingaudiocraft.data.info_audio_dataset.InfoAudioDataset-- parent classaudiocraft.modules.conditioners--ConditioningAttributes,WavCondition,JointEmbedCondition