Principle:Facebookresearch Audiocraft Training Environment Configuration
Overview
Training Environment Configuration addresses the challenge of running large-scale distributed machine learning training across heterogeneous compute clusters. In the context of the MusicGen training pipeline, the training environment must be correctly configured before any training can commence -- this includes detecting the cluster type (e.g., SLURM-managed clusters, local workstations, macOS development machines), resolving filesystem paths for experiment tracking (via Dora), and mapping dataset file paths so that the same manifest files can be reused across different storage systems.
The core idea is to decouple environment-specific configuration (where checkpoints live, where datasets are stored, which SLURM partitions to use) from experiment-specific configuration (model architecture, learning rate, batch size). This separation allows researchers to write a single experiment definition that runs unchanged on a local workstation, a small GPU cluster, or a large-scale SLURM-managed datacenter.
Theoretical Foundations
Cluster Abstraction
Modern ML training often spans multiple environments: a researcher may develop locally, test on a small cluster, then launch full training on a large SLURM-managed cluster. Each environment has different:
- Filesystem layouts -- checkpoint directories, dataset mount points, and reference file locations differ across clusters.
- Job schedulers -- SLURM parameters (partitions, exclusion lists, memory limits) vary by team and cluster.
- Team-level policies -- different research teams may share a cluster but have separate storage quotas and partition allocations.
The environment configuration pattern solves this by:
- Auto-detecting the cluster type at runtime (inspecting environment variables like
SLURM_JOB_IDor platform detection). - Loading team-specific YAML configs that map cluster types to concrete paths and partition names.
- Exposing a singleton interface so all downstream components (Dora, the solver, the dataset loader) query the same resolved paths.
Dora Experiment Management
Dora is Meta's experiment management framework built on top of Hydra. It assigns each unique configuration a deterministic signature (sig), stores artifacts in a structured directory tree, and supports history replay for experiment resumption. The environment configuration feeds Dora by providing:
- The dora_dir -- the root directory where Dora stores all experiment folders (
xps/), shared artifacts, and grid results. - The reference_dir -- a shared directory for pretrained models, evaluation checkpoints (e.g., VGGish for FAD), and other reference files referenced via the
//referencepath prefix.
Dataset Path Mapping
When training data is stored at different paths across clusters (e.g., /datasets/music/ on one cluster vs. /mnt/shared/music/ on another), the environment provides dataset mappers -- regular expression substitution rules declared in the team YAML config that transparently rewrite file paths in manifest files at load time.
Key Principles
- Singleton pattern -- The environment is instantiated once and cached. All access goes through the class-level
instance()method, ensuring consistent configuration across the entire training process. - Convention over configuration -- Sensible defaults are loaded from
config/teams/default.yaml. Environment variables (e.g.,AUDIOCRAFT_TEAM,AUDIOCRAFT_DORA_DIR) can override any default, but are not required. - Cluster auto-detection -- The cluster type is inferred automatically and only needs to be overridden in unusual setups via
AUDIOCRAFT_CLUSTER. - Path indirection via //reference -- Paths in configuration files can use the
//referenceprefix, which is resolved at runtime to the cluster-specific reference directory. This allows portable config files.
Role in the MusicGen Training Pipeline
The environment configuration is the first component initialized in the MusicGen training pipeline. Before any training code runs:
AudioCraftEnvironmentis initialized (typically triggered bytrain.pyimporting it to setmain.dora.dir).- Dora uses the resolved
dora_dirto create or locate the experiment folder. - Dataset paths in manifests are transparently rewritten via dataset mappers.
- Reference paths (e.g., for pretrained EnCodec checkpoints, FAD model checkpoints) are resolved.
Without correct environment configuration, subsequent pipeline stages (dataset loading, model instantiation, checkpoint saving) cannot find the files they need.