Heuristic:NVIDIA NeMo Aligner PPO NCCL Algorithm Setting
| Knowledge Sources | |
|---|---|
| Domains | Distributed_Training, Debugging, PPO |
| Last Updated | 2026-02-07 22:00 GMT |
Overview
Critical stability fix requiring `export NCCL_ALGO=Tree` when running PPO training to prevent NCCL communication hangs and deadlocks.
Description
PPO training in NeMo-Aligner involves complex multi-process communication between actor, critic, and reward model servers. The default NCCL algorithm selection can cause deadlocks or hangs during collective operations (all-reduce, broadcast) in this multi-server setup. Setting `NCCL_ALGO=Tree` forces NCCL to use the tree-based algorithm for all collective operations, which is more stable in the presence of the mixed communication patterns that PPO requires.
Usage
Use this heuristic always when running PPO or REINFORCE training. This is a required setting documented in the official RLHF user guide and CHANGELOG as a stability fix. Without this setting, training may hang indefinitely during distributed communication.
The Insight (Rule of Thumb)
- Action: Add `export NCCL_ALGO=Tree` to your training launch script before running PPO or REINFORCE training.
- Value: `Tree` (other options include `Ring`, but `Tree` is required for stability).
- Trade-off: Minimal performance impact. The tree algorithm may be slightly slower than ring for certain communication patterns, but it prevents hangs.
Reasoning
PPO training involves multiple independent process groups (actor DP group, critic server, reward server) that perform interleaved collective operations. The default NCCL algorithm auto-selection can choose different algorithms for different operations, creating deadlock conditions when process groups overlap. The tree algorithm provides a consistent communication pattern that avoids these deadlocks. This was identified as a bug fix in the NeMo-Aligner CHANGELOG and is now a documented requirement.
Code Evidence
CHANGELOG bug fix entry from `CHANGELOG.md:52`:
- It is now required, for stability, to add `export NCCL_ALGO=...` to scripts launching PPO training loop.
Please see the [RLHF docs](./docs/user-guide/rlhf.rst) for information.
RLHF documentation requirement from `docs/user-guide/rlhf.rst:260-262`:
# recommended to set NCCL_ALGO. See https://github.com/NVIDIA/Megatron-LM/blob/
# b3375a0e38c10e2300ef4be031f7dcabab52b448/megatron/training/arguments.py#L593-L595
export NCCL_ALGO=Tree
Functional test scripts setting NCCL_ALGO from `tests/functional/ppo.sh:21`:
export NCCL_ALGO=Tree