Environment:Facebookresearch Audiocraft Python PyTorch CUDA Environment

Knowledge Sources	AudioCraft requirements.txt setup.py
Domains	Infrastructure, Deep_Learning, Audio_Generation
Last Updated	2026-02-13 23:00 GMT

Overview

Linux environment with Python 3.8+, PyTorch 2.1.0 (CUDA-enabled), torchaudio, and system-level audio libraries (ffmpeg, libsndfile) for running AudioCraft inference and training.

Description

This environment provides the core runtime for all AudioCraft operations including MusicGen inference, EnCodec compression training, JASCO generation, and model export. It is built on PyTorch 2.1.0 with CUDA support as a hard requirement (the requirements.txt explicitly states "please make sure you have already a pytorch install that is cuda enabled"). The environment includes a full stack of audio processing libraries (torchaudio, av/ffmpeg, soundfile, librosa), deep learning utilities (einops, flashy, hydra-core, transformers), and evaluation tools (torchmetrics, pesq, pystoi).

The CI environment uses Python 3.9 on Ubuntu with system packages libsndfile1-dev and ffmpeg installed via apt-get. CPU-only execution is partially supported (inference only, with automatic dtype fallback to float32), but training requires NVIDIA GPU hardware.

Usage

Use this environment for all AudioCraft operations. It is the mandatory prerequisite for every Implementation page in this wiki. Without this environment configured, no AudioCraft code can execute.

System Requirements

Category	Requirement	Notes
OS	Linux (Ubuntu 20.04+ recommended)	macOS (Darwin) supported for local development only
Hardware	NVIDIA GPU with CUDA support	CPU-only for inference with float32 dtype; training requires GPU
Python	>= 3.8.0	CI uses Python 3.9; defined in `setup.py` `REQUIRES_PYTHON`
System Packages	`ffmpeg`, `libsndfile1-dev`	Required for audio I/O; installed via `apt-get`
Disk	10GB+ for packages, 50GB+ for models	Pretrained models downloaded from HuggingFace Hub

Dependencies

System Packages

ffmpeg — Required for audio decoding/encoding via PyAV
libsndfile1-dev — Required by soundfile for .flac/.ogg reading

Python Packages

torch == 2.1.0
torchaudio >= 2.0.0, < 2.1.2
torchvision == 0.16.0
torchtext == 0.16.0
av == 11.0.0
einops
flashy >= 0.0.1
hydra-core >= 1.1
hydra_colorlog
julius
num2words
numpy < 2.0.0
sentencepiece
spacy == 3.7.6
huggingface_hub
tqdm
transformers >= 4.31.0
xformers < 0.23
demucs
librosa
soundfile
gradio
torchmetrics
encodec
protobuf
pesq
pystoi
torchdiffeq

Optional Extras

audioseal — Required for watermarking (pip install audiocraft[wm])
coverage, flake8, mypy, pdoc3, pytest — Dev tools (pip install audiocraft[dev])

Credentials

No API keys or credentials are required for the core environment. Model weights are downloaded from public HuggingFace Hub repositories (e.g., facebook/musicgen-small).

Optional:

AUDIOCRAFT_CACHE_DIR: Override default cache directory for downloaded model checkpoints.

Quick Install

# System dependencies (Ubuntu/Debian)
sudo apt-get update && sudo apt-get install -y libsndfile1-dev ffmpeg

# Install PyTorch with CUDA first
pip install 'numpy<2' torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0

# Install xformers (matching PyTorch version)
pip install xformers==0.0.22.post7

# Install AudioCraft
pip install -e '.[dev,wm]'

Code Evidence

CUDA requirement comment from requirements.txt:1:

# please make sure you have already a pytorch install that is cuda enabled!

Python version requirement from setup.py:18:

REQUIRES_PYTHON = '>=3.8.0'

Transformers version constraint from requirements.txt:16:

transformers>=4.31.0  # need Encodec there.

Device-specific dtype fallback from audiocraft/models/loaders.py:115-118:

if cfg.device == 'cpu':
    cfg.dtype = 'float32'
else:
    cfg.dtype = 'float16'

CPU float16 weight handling from audiocraft/models/lm.py:80-83:

if m.weight.device.type == 'cpu' and m.weight.dtype == torch.float16:
    weight = m.weight.float()
    init_fn(weight)
    m.weight.data[:] = weight.half()

CI build environment from .github/actions/audiocraft_build/action.yml:8,20-26:

python-version: 3.9
# ...
sudo apt-get install libsndfile1-dev ffmpeg
pip install 'numpy<2' torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0
pip install xformers==0.0.22.post7
pip install -e '.[dev,wm]'

Common Errors

Error Message	Cause	Solution
`RuntimeError: Couldn't find appropriate backend to handle uri`	ffmpeg or libsndfile not installed	`sudo apt-get install ffmpeg libsndfile1-dev`
`ImportError: No module named 'encodec'`	Missing encodec package	`pip install encodec`
`RuntimeError: expected CUDA device`	Running on CPU without float32 override	Use `device='cpu'` which auto-selects float32 dtype
`numpy.dtype size changed`	NumPy 2.x installed (incompatible)	`pip install 'numpy<2.0.0'`

Compatibility Notes

CPU inference: Supported but slow; dtype automatically set to float32 when device is CPU.
macOS (Darwin): Supported for local development; cluster type detected as LOCAL_DARWIN.
FSDP training: Requires multiple NVIDIA GPUs; local_rank < torch.cuda.device_count() is asserted.
Audio formats: .flac and .ogg files use soundfile (not ffmpeg/av) due to known ffmpeg edge cases.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment