Environment:Shiyu coder Kronos DDP Multi GPU Environment

Knowledge Sources	Kronos PyTorch Distributed
Domains	Infrastructure, Distributed_Training
Last Updated	2026-02-09 13:47 GMT

Overview

Multi-GPU distributed training environment using PyTorch DistributedDataParallel (DDP) with NCCL backend, launched via `torchrun`.

Description

This environment extends the base PyTorch CUDA environment with distributed data parallel (DDP) training capabilities. It requires multiple NVIDIA GPUs with NCCL communication backend, and must be launched using `torchrun` (or equivalent launcher) which sets the required environment variables (RANK, WORLD_SIZE, LOCAL_RANK). The Qlib finetuning scripts (`train_tokenizer.py`, `train_predictor.py`) mandate this environment, while the CSV finetuning scripts (`train_sequential.py`) optionally support it with graceful single-GPU fallback.

Usage

Use this environment when running Qlib finetuning workflows (tokenizer training and predictor training). The Qlib training scripts will raise a `RuntimeError` if not launched with `torchrun`. The CSV finetuning pipeline supports DDP optionally when `world_size > 1`.

System Requirements

Category	Requirement	Notes
OS	Linux (Ubuntu 20.04+)	NCCL requires Linux
Hardware	1+ NVIDIA GPUs	Each process bound to one GPU via LOCAL_RANK
Network	High-bandwidth interconnect	NVLink or PCIe for multi-GPU; InfiniBand for multi-node
CUDA	CUDA-capable GPUs	Required for NCCL backend

Dependencies

System Packages

CUDA Toolkit (matching PyTorch build)
NCCL library (bundled with PyTorch CUDA builds)

Python Packages

`torch` >= 2.0.0 (with CUDA support)
`torch.distributed` module (included in standard PyTorch)

Credentials

The following environment variables must be set by the launcher (`torchrun`):

`RANK`: Global rank of the current process (0 to world_size-1)
`WORLD_SIZE`: Total number of processes across all nodes
`LOCAL_RANK`: Local rank on the current node (determines GPU assignment)
`DIST_BACKEND` (optional): Distributed backend, defaults to `nccl`. Can be set to `gloo` for CPU-only fallback in CSV finetuning.

Quick Install

# PyTorch with CUDA support (select appropriate CUDA version)
pip install torch>=2.0.0 --index-url https://download.pytorch.org/whl/cu118

# Launch training with torchrun (example: 2 GPUs)
torchrun --nproc_per_node=2 finetune/train_tokenizer.py
torchrun --nproc_per_node=2 finetune/train_predictor.py

Code Evidence

DDP setup with NCCL from `finetune/utils/training_utils.py:9-32`:

def setup_ddp():
    if not dist.is_available():
        raise RuntimeError("torch.distributed is not available.")
    dist.init_process_group(backend="nccl")
    rank = int(os.environ["RANK"])
    world_size = int(os.environ["WORLD_SIZE"])
    local_rank = int(os.environ["LOCAL_RANK"])
    torch.cuda.set_device(local_rank)
    return rank, world_size, local_rank

Mandatory torchrun check from `finetune/train_predictor.py:238-241`:

if "WORLD_SIZE" not in os.environ:
    raise RuntimeError("This script must be launched with `torchrun`.")

Optional DDP in CSV finetuning from `finetune_csv/train_sequential.py:40-46`:

if self.world_size > 1 and torch.cuda.is_available():
    backend = os.environ.get("DIST_BACKEND", "nccl").lower()
    if not dist.is_initialized():
        dist.init_process_group(backend=backend)

Seed management per rank from `finetune/utils/training_utils.py:41-59`:

def set_seed(seed: int, rank: int = 0):
    actual_seed = seed + rank
    random.seed(actual_seed)
    np.random.seed(actual_seed)
    torch.manual_seed(actual_seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(actual_seed)
        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = False

Common Errors

Error Message	Cause	Solution
`RuntimeError: This script must be launched with torchrun.`	Script run with `python` instead of `torchrun`	Use `torchrun --nproc_per_node=N script.py`
`RuntimeError: torch.distributed is not available.`	PyTorch built without distributed support	Reinstall PyTorch with distributed support enabled
`NCCL error: unhandled system error`	GPU communication failure	Check CUDA driver version compatibility and GPU interconnect
`RuntimeError: NCCL communicator was aborted`	Process crashed during training	Check all GPU processes for OOM; reduce batch_size

Compatibility Notes

Qlib finetuning: DDP is mandatory. Scripts will crash without `torchrun`.
CSV finetuning: DDP is optional. Falls back to single-GPU when `WORLD_SIZE=1`.
NCCL backend: Required for GPU training. The `gloo` backend is available as fallback for CPU-only distributed training in CSV finetuning via `DIST_BACKEND=gloo`.
Windows: NCCL is not supported on Windows. Use Linux for distributed training.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment