Environment:Eric mitchell Direct preference optimization PyTorch CUDA

Knowledge Sources	Direct Preference Optimization PyTorch FSDP
Domains	Infrastructure, Deep_Learning
Last Updated	2026-02-08 02:00 GMT

Overview

Linux environment with CUDA-capable GPUs, PyTorch 2.0.1, and NCCL backend for single-GPU and multi-GPU DPO/SFT training.

Description

This environment provides the core GPU-accelerated compute context for all training, evaluation, and checkpoint operations in the DPO repository. It requires PyTorch 2.0.1 with CUDA support, including the `torch.distributed` module with NCCL backend for multi-GPU FSDP and TensorParallel training. The codebase enables TF32 matmul precision globally and relies on CUDA device management for model sharding, gradient synchronization, and mixed-precision training.

Usage

Use this environment for all model training, evaluation, loss computation, and checkpoint saving operations. It is the mandatory prerequisite for running any of the trainer classes (BasicTrainer, FSDPTrainer, TensorParallelTrainer) and all tensor operations including the DPO loss computation, log probability extraction, and concatenated forward passes.

System Requirements

Category	Requirement	Notes
OS	Linux	NCCL backend requires Linux; FSDP tested on Ubuntu
Hardware	NVIDIA GPU with CUDA support	4x 80GB A100s used in reference experiments; minimum 1 GPU required
Hardware (FSDP)	Multiple NVIDIA GPUs	FSDP shards model across all available `torch.cuda.device_count()` GPUs
Disk	50GB+ SSD	For model checkpoints (policy.pt, optimizer.pt, scheduler.pt per step)
File Descriptors	64000+	FSDP requires `ulimit -n 64000` (set in train.py:L108-110 via RLIMIT_NOFILE)

Dependencies

System Packages

CUDA toolkit (compatible with PyTorch 2.0.1)
NCCL (for distributed training)

Python Packages

`torch` == 2.0.1
`numpy` == 1.24.3
`tqdm` == 4.65.0
`tensor-parallel` == 1.2.4

Credentials

The following environment variables may be set at runtime:

`WANDB_CACHE_DIR`: Set automatically by code to local cache directory for W&B logging.
`XDG_CACHE_HOME`: Set automatically by code to local cache directory for HuggingFace model downloads.
`MASTER_ADDR`: Set to `localhost` by default for distributed training (utils.py:L149).
`MASTER_PORT`: Set automatically to an open port for FSDP (utils.py:L150).

Quick Install

# Install PyTorch with CUDA support (adjust CUDA version as needed)
pip install torch==2.0.1 numpy==1.24.3 tqdm==4.65.0 tensor-parallel==1.2.4

# For FSDP training, increase file descriptor limit
ulimit -n 64000

Code Evidence

TF32 matmul precision enabled globally in `train.py:2` and `trainers.py:2`:

import torch
torch.backends.cuda.matmul.allow_tf32 = True

CUDA device management for FSDP in `utils.py:148-153`:

def init_distributed(rank: int, world_size: int, master_addr: str = 'localhost', port: int = 12355, backend: str = 'nccl'):
    print(rank, 'initializing distributed')
    os.environ["MASTER_ADDR"] = master_addr
    os.environ["MASTER_PORT"] = str(port)
    dist.init_process_group(backend, rank=rank, world_size=world_size)
    torch.cuda.set_device(rank)

Multi-GPU detection and FSDP spawning in `train.py:105-111`:

if 'FSDP' in config.trainer:
    world_size = torch.cuda.device_count()
    print('starting', world_size, 'processes for FSDP training')
    soft, hard = resource.getrlimit(resource.RLIMIT_NOFILE)
    resource.setrlimit(resource.RLIMIT_NOFILE, (hard, hard))
    print(f'setting RLIMIT_NOFILE soft limit to {hard} from {soft}')
    mp.spawn(worker_main, nprocs=world_size, args=(world_size, config, policy, reference_model), join=True)

GPU memory diagnostics in `utils.py:106-117`:

def print_gpu_memory(rank: int = None, message: str = ''):
    if torch.cuda.is_available():
        device_count = torch.cuda.device_count()
        for i in range(device_count):
            device = torch.device(f'cuda:{i}')
            allocated_bytes = torch.cuda.memory_allocated(device)

Common Errors

Error Message	Cause	Solution
`RuntimeError: NCCL error`	NCCL not installed or GPU communication failure	Ensure NCCL is installed and all GPUs are visible via `CUDA_VISIBLE_DEVICES`
`RuntimeError: CUDA out of memory`	Insufficient GPU VRAM for model + optimizer states	Enable activation checkpointing (`activation_checkpointing=true`), use mixed precision (`model.fsdp_policy_mp=bfloat16`), or reduce batch size
`OSError: [Errno 24] Too many open files`	File descriptor limit too low for FSDP	Run `ulimit -n 64000` before training
`ValueError: Could not find block class X in model`	Incorrect `model.block_name` for FSDP wrapping	Verify the transformer block class name matches the model architecture (e.g., `GPT2Block`, `GPTNeoXLayer`, `LlamaDecoderLayer`)

Compatibility Notes

BasicTrainer: Uses `device_map='balanced'` to naively split model layers across available GPUs. No FSDP or distributed init required.
FSDPTrainer: Requires NCCL backend (Linux only). Uses `torch.distributed` with `mp.spawn`. Mixed precision supported via `MixedPrecision` policy.
TensorParallelTrainer: Uses `tensor_parallel` library. Experimental; sampling is extremely slow (see BlackSamorez/tensor_parallel#66).
TF32 Precision: Enabled globally. Provides faster matmul on Ampere+ GPUs with minimal precision loss.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment