Heuristic:Facebookresearch Audiocraft Chroma Conditioning Cache Requirement
| Knowledge Sources | |
|---|---|
| Domains | Audio_Generation, Debugging, Optimization |
| Last Updated | 2026-02-13 23:00 GMT |
Overview
Chroma/melody conditioning during MusicGen training requires a precomputed cache; computing conditions on-the-fly causes NaN propagation and training collapse.
Description
When training MusicGen with melody (chroma) conditioning, the chromagram features extracted from audio must be precomputed and cached before training begins. If the chroma cache is not available, the training code attempts to compute chromagram features from the audio segment, but this leads to a numerical mismatch between the conditioned tokens and the cached codec tokens, resulting in NaN values that propagate through the entire training loss.
Additionally, a critical behavior change was made in AudioCraft v1.1.0: the _prepare_tokens_and_attributes method was previously wrapped in torch.no_grad(), which was inconsistent with how models were trained for the MusicGen paper. Removing the no_grad wrapper allows gradients to flow through the conditioning pipeline, which is the correct behavior.
Usage
Apply this heuristic when setting up MusicGen melody or chroma-conditioned training. Always generate the conditioning cache before starting training. Watch for NaN loss values as an indicator of missing or stale cache.
The Insight (Rule of Thumb)
- Action: Always precompute and cache chroma conditioning embeddings before training. Use the caching system in
audiocraft/utils/cache.pyto generate caches. - Value: N/A (boolean requirement — cache must exist).
- Trade-off: Precomputing the cache requires an initial pass over the entire dataset (can be slow for large datasets), but eliminates NaN errors and speeds up subsequent training epochs.
Reasoning
The NaN issue arises because chroma features must be extracted from the original audio waveform at the exact same segment boundaries used for codec token extraction. When computed on-the-fly during training, the audio segment boundaries may not align perfectly with the cached codec tokens, leading to a mismatch. The chroma conditioner processes this misaligned input and produces features that, when combined with the codec tokens in the transformer, result in undefined gradients.
The v1.0.1 → v1.1.0 gradient flow fix is important because the conditioning pipeline includes learnable parameters (e.g., projection layers in the text encoder). These parameters should receive gradients during training to fine-tune alongside the LM.
Code Evidence
Chroma cache warning from audiocraft/solvers/musicgen.py:299-307:
# Careful here, if you want to use this condition_wav (e.b. chroma conditioning),
# then you must be using the chroma cache! otherwise the code will try
# to use this segment and fail (by that I mean you will see NaN everywhere).
Gradient flow deprecation warning from audiocraft/solvers/musicgen.py:276-280:
warnings.warn(
"Up to version 1.0.1, the _prepare_tokens_and_attributes was evaluated "
"with `torch.no_grad()`. This is inconsistent with how model were trained "
"in the MusicGen paper. We removed the `torch.no_grad()` in version 1.1.0. "
"Small changes to the final performance are expected. Really sorry about that.")
LM-compression model compatibility check from audiocraft/solvers/musicgen.py:151-160:
assert self.cfg.transformer_lm.card == self.compression_model.cardinality, (
"Cardinalities of the brains and brawn must match."
)
assert self.cfg.transformer_lm.n_q == self.compression_model.num_codebooks, (
"Number of codebooks of the brains and brawn must match."
)