Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Facebookresearch Audiocraft Training Environment Configuration

From Leeroopedia

Overview

Training Environment Configuration addresses the challenge of running large-scale distributed machine learning training across heterogeneous compute clusters. In the context of the MusicGen training pipeline, the training environment must be correctly configured before any training can commence -- this includes detecting the cluster type (e.g., SLURM-managed clusters, local workstations, macOS development machines), resolving filesystem paths for experiment tracking (via Dora), and mapping dataset file paths so that the same manifest files can be reused across different storage systems.

The core idea is to decouple environment-specific configuration (where checkpoints live, where datasets are stored, which SLURM partitions to use) from experiment-specific configuration (model architecture, learning rate, batch size). This separation allows researchers to write a single experiment definition that runs unchanged on a local workstation, a small GPU cluster, or a large-scale SLURM-managed datacenter.

Theoretical Foundations

Cluster Abstraction

Modern ML training often spans multiple environments: a researcher may develop locally, test on a small cluster, then launch full training on a large SLURM-managed cluster. Each environment has different:

  • Filesystem layouts -- checkpoint directories, dataset mount points, and reference file locations differ across clusters.
  • Job schedulers -- SLURM parameters (partitions, exclusion lists, memory limits) vary by team and cluster.
  • Team-level policies -- different research teams may share a cluster but have separate storage quotas and partition allocations.

The environment configuration pattern solves this by:

  1. Auto-detecting the cluster type at runtime (inspecting environment variables like SLURM_JOB_ID or platform detection).
  2. Loading team-specific YAML configs that map cluster types to concrete paths and partition names.
  3. Exposing a singleton interface so all downstream components (Dora, the solver, the dataset loader) query the same resolved paths.

Dora Experiment Management

Dora is Meta's experiment management framework built on top of Hydra. It assigns each unique configuration a deterministic signature (sig), stores artifacts in a structured directory tree, and supports history replay for experiment resumption. The environment configuration feeds Dora by providing:

  • The dora_dir -- the root directory where Dora stores all experiment folders (xps/), shared artifacts, and grid results.
  • The reference_dir -- a shared directory for pretrained models, evaluation checkpoints (e.g., VGGish for FAD), and other reference files referenced via the //reference path prefix.

Dataset Path Mapping

When training data is stored at different paths across clusters (e.g., /datasets/music/ on one cluster vs. /mnt/shared/music/ on another), the environment provides dataset mappers -- regular expression substitution rules declared in the team YAML config that transparently rewrite file paths in manifest files at load time.

Key Principles

  • Singleton pattern -- The environment is instantiated once and cached. All access goes through the class-level instance() method, ensuring consistent configuration across the entire training process.
  • Convention over configuration -- Sensible defaults are loaded from config/teams/default.yaml. Environment variables (e.g., AUDIOCRAFT_TEAM, AUDIOCRAFT_DORA_DIR) can override any default, but are not required.
  • Cluster auto-detection -- The cluster type is inferred automatically and only needs to be overridden in unusual setups via AUDIOCRAFT_CLUSTER.
  • Path indirection via //reference -- Paths in configuration files can use the //reference prefix, which is resolved at runtime to the cluster-specific reference directory. This allows portable config files.

Role in the MusicGen Training Pipeline

The environment configuration is the first component initialized in the MusicGen training pipeline. Before any training code runs:

  1. AudioCraftEnvironment is initialized (typically triggered by train.py importing it to set main.dora.dir).
  2. Dora uses the resolved dora_dir to create or locate the experiment folder.
  3. Dataset paths in manifests are transparently rewritten via dataset mappers.
  4. Reference paths (e.g., for pretrained EnCodec checkpoints, FAD model checkpoints) are resolved.

Without correct environment configuration, subsequent pipeline stages (dataset loading, model instantiation, checkpoint saving) cannot find the files they need.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment