Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Facebookresearch Audiocraft Python PyTorch CUDA Environment

From Leeroopedia
Knowledge Sources
Domains Infrastructure, Deep_Learning, Audio_Generation
Last Updated 2026-02-13 23:00 GMT

Overview

Linux environment with Python 3.8+, PyTorch 2.1.0 (CUDA-enabled), torchaudio, and system-level audio libraries (ffmpeg, libsndfile) for running AudioCraft inference and training.

Description

This environment provides the core runtime for all AudioCraft operations including MusicGen inference, EnCodec compression training, JASCO generation, and model export. It is built on PyTorch 2.1.0 with CUDA support as a hard requirement (the requirements.txt explicitly states "please make sure you have already a pytorch install that is cuda enabled"). The environment includes a full stack of audio processing libraries (torchaudio, av/ffmpeg, soundfile, librosa), deep learning utilities (einops, flashy, hydra-core, transformers), and evaluation tools (torchmetrics, pesq, pystoi).

The CI environment uses Python 3.9 on Ubuntu with system packages libsndfile1-dev and ffmpeg installed via apt-get. CPU-only execution is partially supported (inference only, with automatic dtype fallback to float32), but training requires NVIDIA GPU hardware.

Usage

Use this environment for all AudioCraft operations. It is the mandatory prerequisite for every Implementation page in this wiki. Without this environment configured, no AudioCraft code can execute.

System Requirements

Category Requirement Notes
OS Linux (Ubuntu 20.04+ recommended) macOS (Darwin) supported for local development only
Hardware NVIDIA GPU with CUDA support CPU-only for inference with float32 dtype; training requires GPU
Python >= 3.8.0 CI uses Python 3.9; defined in setup.py REQUIRES_PYTHON
System Packages ffmpeg, libsndfile1-dev Required for audio I/O; installed via apt-get
Disk 10GB+ for packages, 50GB+ for models Pretrained models downloaded from HuggingFace Hub

Dependencies

System Packages

  • ffmpeg — Required for audio decoding/encoding via PyAV
  • libsndfile1-dev — Required by soundfile for .flac/.ogg reading

Python Packages

  • torch == 2.1.0
  • torchaudio >= 2.0.0, < 2.1.2
  • torchvision == 0.16.0
  • torchtext == 0.16.0
  • av == 11.0.0
  • einops
  • flashy >= 0.0.1
  • hydra-core >= 1.1
  • hydra_colorlog
  • julius
  • num2words
  • numpy < 2.0.0
  • sentencepiece
  • spacy == 3.7.6
  • huggingface_hub
  • tqdm
  • transformers >= 4.31.0
  • xformers < 0.23
  • demucs
  • librosa
  • soundfile
  • gradio
  • torchmetrics
  • encodec
  • protobuf
  • pesq
  • pystoi
  • torchdiffeq

Optional Extras

  • audioseal — Required for watermarking (pip install audiocraft[wm])
  • coverage, flake8, mypy, pdoc3, pytest — Dev tools (pip install audiocraft[dev])

Credentials

No API keys or credentials are required for the core environment. Model weights are downloaded from public HuggingFace Hub repositories (e.g., facebook/musicgen-small).

Optional:

  • AUDIOCRAFT_CACHE_DIR: Override default cache directory for downloaded model checkpoints.

Quick Install

# System dependencies (Ubuntu/Debian)
sudo apt-get update && sudo apt-get install -y libsndfile1-dev ffmpeg

# Install PyTorch with CUDA first
pip install 'numpy<2' torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0

# Install xformers (matching PyTorch version)
pip install xformers==0.0.22.post7

# Install AudioCraft
pip install -e '.[dev,wm]'

Code Evidence

CUDA requirement comment from requirements.txt:1:

# please make sure you have already a pytorch install that is cuda enabled!

Python version requirement from setup.py:18:

REQUIRES_PYTHON = '>=3.8.0'

Transformers version constraint from requirements.txt:16:

transformers>=4.31.0  # need Encodec there.

Device-specific dtype fallback from audiocraft/models/loaders.py:115-118:

if cfg.device == 'cpu':
    cfg.dtype = 'float32'
else:
    cfg.dtype = 'float16'

CPU float16 weight handling from audiocraft/models/lm.py:80-83:

if m.weight.device.type == 'cpu' and m.weight.dtype == torch.float16:
    weight = m.weight.float()
    init_fn(weight)
    m.weight.data[:] = weight.half()

CI build environment from .github/actions/audiocraft_build/action.yml:8,20-26:

python-version: 3.9
# ...
sudo apt-get install libsndfile1-dev ffmpeg
pip install 'numpy<2' torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0
pip install xformers==0.0.22.post7
pip install -e '.[dev,wm]'

Common Errors

Error Message Cause Solution
RuntimeError: Couldn't find appropriate backend to handle uri ffmpeg or libsndfile not installed sudo apt-get install ffmpeg libsndfile1-dev
ImportError: No module named 'encodec' Missing encodec package pip install encodec
RuntimeError: expected CUDA device Running on CPU without float32 override Use device='cpu' which auto-selects float32 dtype
numpy.dtype size changed NumPy 2.x installed (incompatible) pip install 'numpy<2.0.0'

Compatibility Notes

  • CPU inference: Supported but slow; dtype automatically set to float32 when device is CPU.
  • macOS (Darwin): Supported for local development; cluster type detected as LOCAL_DARWIN.
  • FSDP training: Requires multiple NVIDIA GPUs; local_rank < torch.cuda.device_count() is asserted.
  • Audio formats: .flac and .ogg files use soundfile (not ffmpeg/av) due to known ffmpeg edge cases.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment