Environment:Shiyu coder Kronos DDP Multi GPU Environment
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Distributed_Training |
| Last Updated | 2026-02-09 13:47 GMT |
Overview
Multi-GPU distributed training environment using PyTorch DistributedDataParallel (DDP) with NCCL backend, launched via `torchrun`.
Description
This environment extends the base PyTorch CUDA environment with distributed data parallel (DDP) training capabilities. It requires multiple NVIDIA GPUs with NCCL communication backend, and must be launched using `torchrun` (or equivalent launcher) which sets the required environment variables (RANK, WORLD_SIZE, LOCAL_RANK). The Qlib finetuning scripts (`train_tokenizer.py`, `train_predictor.py`) mandate this environment, while the CSV finetuning scripts (`train_sequential.py`) optionally support it with graceful single-GPU fallback.
Usage
Use this environment when running Qlib finetuning workflows (tokenizer training and predictor training). The Qlib training scripts will raise a `RuntimeError` if not launched with `torchrun`. The CSV finetuning pipeline supports DDP optionally when `world_size > 1`.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux (Ubuntu 20.04+) | NCCL requires Linux |
| Hardware | 1+ NVIDIA GPUs | Each process bound to one GPU via LOCAL_RANK |
| Network | High-bandwidth interconnect | NVLink or PCIe for multi-GPU; InfiniBand for multi-node |
| CUDA | CUDA-capable GPUs | Required for NCCL backend |
Dependencies
System Packages
- CUDA Toolkit (matching PyTorch build)
- NCCL library (bundled with PyTorch CUDA builds)
Python Packages
- `torch` >= 2.0.0 (with CUDA support)
- `torch.distributed` module (included in standard PyTorch)
Credentials
The following environment variables must be set by the launcher (`torchrun`):
- `RANK`: Global rank of the current process (0 to world_size-1)
- `WORLD_SIZE`: Total number of processes across all nodes
- `LOCAL_RANK`: Local rank on the current node (determines GPU assignment)
- `DIST_BACKEND` (optional): Distributed backend, defaults to `nccl`. Can be set to `gloo` for CPU-only fallback in CSV finetuning.
Quick Install
# PyTorch with CUDA support (select appropriate CUDA version)
pip install torch>=2.0.0 --index-url https://download.pytorch.org/whl/cu118
# Launch training with torchrun (example: 2 GPUs)
torchrun --nproc_per_node=2 finetune/train_tokenizer.py
torchrun --nproc_per_node=2 finetune/train_predictor.py
Code Evidence
DDP setup with NCCL from `finetune/utils/training_utils.py:9-32`:
def setup_ddp():
if not dist.is_available():
raise RuntimeError("torch.distributed is not available.")
dist.init_process_group(backend="nccl")
rank = int(os.environ["RANK"])
world_size = int(os.environ["WORLD_SIZE"])
local_rank = int(os.environ["LOCAL_RANK"])
torch.cuda.set_device(local_rank)
return rank, world_size, local_rank
Mandatory torchrun check from `finetune/train_predictor.py:238-241`:
if "WORLD_SIZE" not in os.environ:
raise RuntimeError("This script must be launched with `torchrun`.")
Optional DDP in CSV finetuning from `finetune_csv/train_sequential.py:40-46`:
if self.world_size > 1 and torch.cuda.is_available():
backend = os.environ.get("DIST_BACKEND", "nccl").lower()
if not dist.is_initialized():
dist.init_process_group(backend=backend)
Seed management per rank from `finetune/utils/training_utils.py:41-59`:
def set_seed(seed: int, rank: int = 0):
actual_seed = seed + rank
random.seed(actual_seed)
np.random.seed(actual_seed)
torch.manual_seed(actual_seed)
if torch.cuda.is_available():
torch.cuda.manual_seed_all(actual_seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `RuntimeError: This script must be launched with torchrun.` | Script run with `python` instead of `torchrun` | Use `torchrun --nproc_per_node=N script.py` |
| `RuntimeError: torch.distributed is not available.` | PyTorch built without distributed support | Reinstall PyTorch with distributed support enabled |
| `NCCL error: unhandled system error` | GPU communication failure | Check CUDA driver version compatibility and GPU interconnect |
| `RuntimeError: NCCL communicator was aborted` | Process crashed during training | Check all GPU processes for OOM; reduce batch_size |
Compatibility Notes
- Qlib finetuning: DDP is mandatory. Scripts will crash without `torchrun`.
- CSV finetuning: DDP is optional. Falls back to single-GPU when `WORLD_SIZE=1`.
- NCCL backend: Required for GPU training. The `gloo` backend is available as fallback for CPU-only distributed training in CSV finetuning via `DIST_BACKEND=gloo`.
- Windows: NCCL is not supported on Windows. Use Linux for distributed training.