Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Eric mitchell Direct preference optimization PyTorch CUDA

From Leeroopedia


Knowledge Sources
Domains Infrastructure, Deep_Learning
Last Updated 2026-02-08 02:00 GMT

Overview

Linux environment with CUDA-capable GPUs, PyTorch 2.0.1, and NCCL backend for single-GPU and multi-GPU DPO/SFT training.

Description

This environment provides the core GPU-accelerated compute context for all training, evaluation, and checkpoint operations in the DPO repository. It requires PyTorch 2.0.1 with CUDA support, including the `torch.distributed` module with NCCL backend for multi-GPU FSDP and TensorParallel training. The codebase enables TF32 matmul precision globally and relies on CUDA device management for model sharding, gradient synchronization, and mixed-precision training.

Usage

Use this environment for all model training, evaluation, loss computation, and checkpoint saving operations. It is the mandatory prerequisite for running any of the trainer classes (BasicTrainer, FSDPTrainer, TensorParallelTrainer) and all tensor operations including the DPO loss computation, log probability extraction, and concatenated forward passes.

System Requirements

Category Requirement Notes
OS Linux NCCL backend requires Linux; FSDP tested on Ubuntu
Hardware NVIDIA GPU with CUDA support 4x 80GB A100s used in reference experiments; minimum 1 GPU required
Hardware (FSDP) Multiple NVIDIA GPUs FSDP shards model across all available `torch.cuda.device_count()` GPUs
Disk 50GB+ SSD For model checkpoints (policy.pt, optimizer.pt, scheduler.pt per step)
File Descriptors 64000+ FSDP requires `ulimit -n 64000` (set in train.py:L108-110 via RLIMIT_NOFILE)

Dependencies

System Packages

  • CUDA toolkit (compatible with PyTorch 2.0.1)
  • NCCL (for distributed training)

Python Packages

  • `torch` == 2.0.1
  • `numpy` == 1.24.3
  • `tqdm` == 4.65.0
  • `tensor-parallel` == 1.2.4

Credentials

The following environment variables may be set at runtime:

  • `WANDB_CACHE_DIR`: Set automatically by code to local cache directory for W&B logging.
  • `XDG_CACHE_HOME`: Set automatically by code to local cache directory for HuggingFace model downloads.
  • `MASTER_ADDR`: Set to `localhost` by default for distributed training (utils.py:L149).
  • `MASTER_PORT`: Set automatically to an open port for FSDP (utils.py:L150).

Quick Install

# Install PyTorch with CUDA support (adjust CUDA version as needed)
pip install torch==2.0.1 numpy==1.24.3 tqdm==4.65.0 tensor-parallel==1.2.4

# For FSDP training, increase file descriptor limit
ulimit -n 64000

Code Evidence

TF32 matmul precision enabled globally in `train.py:2` and `trainers.py:2`:

import torch
torch.backends.cuda.matmul.allow_tf32 = True

CUDA device management for FSDP in `utils.py:148-153`:

def init_distributed(rank: int, world_size: int, master_addr: str = 'localhost', port: int = 12355, backend: str = 'nccl'):
    print(rank, 'initializing distributed')
    os.environ["MASTER_ADDR"] = master_addr
    os.environ["MASTER_PORT"] = str(port)
    dist.init_process_group(backend, rank=rank, world_size=world_size)
    torch.cuda.set_device(rank)

Multi-GPU detection and FSDP spawning in `train.py:105-111`:

if 'FSDP' in config.trainer:
    world_size = torch.cuda.device_count()
    print('starting', world_size, 'processes for FSDP training')
    soft, hard = resource.getrlimit(resource.RLIMIT_NOFILE)
    resource.setrlimit(resource.RLIMIT_NOFILE, (hard, hard))
    print(f'setting RLIMIT_NOFILE soft limit to {hard} from {soft}')
    mp.spawn(worker_main, nprocs=world_size, args=(world_size, config, policy, reference_model), join=True)

GPU memory diagnostics in `utils.py:106-117`:

def print_gpu_memory(rank: int = None, message: str = ''):
    if torch.cuda.is_available():
        device_count = torch.cuda.device_count()
        for i in range(device_count):
            device = torch.device(f'cuda:{i}')
            allocated_bytes = torch.cuda.memory_allocated(device)

Common Errors

Error Message Cause Solution
`RuntimeError: NCCL error` NCCL not installed or GPU communication failure Ensure NCCL is installed and all GPUs are visible via `CUDA_VISIBLE_DEVICES`
`RuntimeError: CUDA out of memory` Insufficient GPU VRAM for model + optimizer states Enable activation checkpointing (`activation_checkpointing=true`), use mixed precision (`model.fsdp_policy_mp=bfloat16`), or reduce batch size
`OSError: [Errno 24] Too many open files` File descriptor limit too low for FSDP Run `ulimit -n 64000` before training
`ValueError: Could not find block class X in model` Incorrect `model.block_name` for FSDP wrapping Verify the transformer block class name matches the model architecture (e.g., `GPT2Block`, `GPTNeoXLayer`, `LlamaDecoderLayer`)

Compatibility Notes

  • BasicTrainer: Uses `device_map='balanced'` to naively split model layers across available GPUs. No FSDP or distributed init required.
  • FSDPTrainer: Requires NCCL backend (Linux only). Uses `torch.distributed` with `mp.spawn`. Mixed precision supported via `MixedPrecision` policy.
  • TensorParallelTrainer: Uses `tensor_parallel` library. Experimental; sampling is extremely slow (see BlackSamorez/tensor_parallel#66).
  • TF32 Precision: Enabled globally. Provides faster matmul on Ampere+ GPUs with minimal precision loss.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment