Environment:Facebookresearch Audiocraft Python PyTorch CUDA Environment
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Deep_Learning, Audio_Generation |
| Last Updated | 2026-02-13 23:00 GMT |
Overview
Linux environment with Python 3.8+, PyTorch 2.1.0 (CUDA-enabled), torchaudio, and system-level audio libraries (ffmpeg, libsndfile) for running AudioCraft inference and training.
Description
This environment provides the core runtime for all AudioCraft operations including MusicGen inference, EnCodec compression training, JASCO generation, and model export. It is built on PyTorch 2.1.0 with CUDA support as a hard requirement (the requirements.txt explicitly states "please make sure you have already a pytorch install that is cuda enabled"). The environment includes a full stack of audio processing libraries (torchaudio, av/ffmpeg, soundfile, librosa), deep learning utilities (einops, flashy, hydra-core, transformers), and evaluation tools (torchmetrics, pesq, pystoi).
The CI environment uses Python 3.9 on Ubuntu with system packages libsndfile1-dev and ffmpeg installed via apt-get. CPU-only execution is partially supported (inference only, with automatic dtype fallback to float32), but training requires NVIDIA GPU hardware.
Usage
Use this environment for all AudioCraft operations. It is the mandatory prerequisite for every Implementation page in this wiki. Without this environment configured, no AudioCraft code can execute.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux (Ubuntu 20.04+ recommended) | macOS (Darwin) supported for local development only |
| Hardware | NVIDIA GPU with CUDA support | CPU-only for inference with float32 dtype; training requires GPU |
| Python | >= 3.8.0 | CI uses Python 3.9; defined in setup.py REQUIRES_PYTHON
|
| System Packages | ffmpeg, libsndfile1-dev |
Required for audio I/O; installed via apt-get
|
| Disk | 10GB+ for packages, 50GB+ for models | Pretrained models downloaded from HuggingFace Hub |
Dependencies
System Packages
ffmpeg— Required for audio decoding/encoding via PyAVlibsndfile1-dev— Required by soundfile for .flac/.ogg reading
Python Packages
torch== 2.1.0torchaudio>= 2.0.0, < 2.1.2torchvision== 0.16.0torchtext== 0.16.0av== 11.0.0einopsflashy>= 0.0.1hydra-core>= 1.1hydra_colorlogjuliusnum2wordsnumpy< 2.0.0sentencepiecespacy== 3.7.6huggingface_hubtqdmtransformers>= 4.31.0xformers< 0.23demucslibrosasoundfilegradiotorchmetricsencodecprotobufpesqpystoitorchdiffeq
Optional Extras
audioseal— Required for watermarking (pip install audiocraft[wm])coverage,flake8,mypy,pdoc3,pytest— Dev tools (pip install audiocraft[dev])
Credentials
No API keys or credentials are required for the core environment. Model weights are downloaded from public HuggingFace Hub repositories (e.g., facebook/musicgen-small).
Optional:
AUDIOCRAFT_CACHE_DIR: Override default cache directory for downloaded model checkpoints.
Quick Install
# System dependencies (Ubuntu/Debian)
sudo apt-get update && sudo apt-get install -y libsndfile1-dev ffmpeg
# Install PyTorch with CUDA first
pip install 'numpy<2' torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0
# Install xformers (matching PyTorch version)
pip install xformers==0.0.22.post7
# Install AudioCraft
pip install -e '.[dev,wm]'
Code Evidence
CUDA requirement comment from requirements.txt:1:
# please make sure you have already a pytorch install that is cuda enabled!
Python version requirement from setup.py:18:
REQUIRES_PYTHON = '>=3.8.0'
Transformers version constraint from requirements.txt:16:
transformers>=4.31.0 # need Encodec there.
Device-specific dtype fallback from audiocraft/models/loaders.py:115-118:
if cfg.device == 'cpu':
cfg.dtype = 'float32'
else:
cfg.dtype = 'float16'
CPU float16 weight handling from audiocraft/models/lm.py:80-83:
if m.weight.device.type == 'cpu' and m.weight.dtype == torch.float16:
weight = m.weight.float()
init_fn(weight)
m.weight.data[:] = weight.half()
CI build environment from .github/actions/audiocraft_build/action.yml:8,20-26:
python-version: 3.9
# ...
sudo apt-get install libsndfile1-dev ffmpeg
pip install 'numpy<2' torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0
pip install xformers==0.0.22.post7
pip install -e '.[dev,wm]'
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
RuntimeError: Couldn't find appropriate backend to handle uri |
ffmpeg or libsndfile not installed | sudo apt-get install ffmpeg libsndfile1-dev
|
ImportError: No module named 'encodec' |
Missing encodec package | pip install encodec
|
RuntimeError: expected CUDA device |
Running on CPU without float32 override | Use device='cpu' which auto-selects float32 dtype
|
numpy.dtype size changed |
NumPy 2.x installed (incompatible) | pip install 'numpy<2.0.0'
|
Compatibility Notes
- CPU inference: Supported but slow; dtype automatically set to float32 when device is CPU.
- macOS (Darwin): Supported for local development; cluster type detected as
LOCAL_DARWIN. - FSDP training: Requires multiple NVIDIA GPUs;
local_rank < torch.cuda.device_count()is asserted. - Audio formats:
.flacand.oggfiles use soundfile (not ffmpeg/av) due to known ffmpeg edge cases.
Related Pages
- Implementation:Facebookresearch_Audiocraft_Audiocraft_Installation
- Implementation:Facebookresearch_Audiocraft_MusicGen_get_pretrained
- Implementation:Facebookresearch_Audiocraft_AudioCraftEnvironment
- Implementation:Facebookresearch_Audiocraft_MusicGenSolver_run_step
- Implementation:Facebookresearch_Audiocraft_CompressionSolver_run_step
- Implementation:Facebookresearch_Audiocraft_JASCO_get_pretrained
- Implementation:Facebookresearch_Audiocraft_Evaluation_Metrics
- Implementation:Facebookresearch_Audiocraft_StandardSolver_Checkpoints