Environment:Eric mitchell Direct preference optimization PyTorch CUDA
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Deep_Learning |
| Last Updated | 2026-02-08 02:00 GMT |
Overview
Linux environment with CUDA-capable GPUs, PyTorch 2.0.1, and NCCL backend for single-GPU and multi-GPU DPO/SFT training.
Description
This environment provides the core GPU-accelerated compute context for all training, evaluation, and checkpoint operations in the DPO repository. It requires PyTorch 2.0.1 with CUDA support, including the `torch.distributed` module with NCCL backend for multi-GPU FSDP and TensorParallel training. The codebase enables TF32 matmul precision globally and relies on CUDA device management for model sharding, gradient synchronization, and mixed-precision training.
Usage
Use this environment for all model training, evaluation, loss computation, and checkpoint saving operations. It is the mandatory prerequisite for running any of the trainer classes (BasicTrainer, FSDPTrainer, TensorParallelTrainer) and all tensor operations including the DPO loss computation, log probability extraction, and concatenated forward passes.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux | NCCL backend requires Linux; FSDP tested on Ubuntu |
| Hardware | NVIDIA GPU with CUDA support | 4x 80GB A100s used in reference experiments; minimum 1 GPU required |
| Hardware (FSDP) | Multiple NVIDIA GPUs | FSDP shards model across all available `torch.cuda.device_count()` GPUs |
| Disk | 50GB+ SSD | For model checkpoints (policy.pt, optimizer.pt, scheduler.pt per step) |
| File Descriptors | 64000+ | FSDP requires `ulimit -n 64000` (set in train.py:L108-110 via RLIMIT_NOFILE) |
Dependencies
System Packages
- CUDA toolkit (compatible with PyTorch 2.0.1)
- NCCL (for distributed training)
Python Packages
- `torch` == 2.0.1
- `numpy` == 1.24.3
- `tqdm` == 4.65.0
- `tensor-parallel` == 1.2.4
Credentials
The following environment variables may be set at runtime:
- `WANDB_CACHE_DIR`: Set automatically by code to local cache directory for W&B logging.
- `XDG_CACHE_HOME`: Set automatically by code to local cache directory for HuggingFace model downloads.
- `MASTER_ADDR`: Set to `localhost` by default for distributed training (utils.py:L149).
- `MASTER_PORT`: Set automatically to an open port for FSDP (utils.py:L150).
Quick Install
# Install PyTorch with CUDA support (adjust CUDA version as needed)
pip install torch==2.0.1 numpy==1.24.3 tqdm==4.65.0 tensor-parallel==1.2.4
# For FSDP training, increase file descriptor limit
ulimit -n 64000
Code Evidence
TF32 matmul precision enabled globally in `train.py:2` and `trainers.py:2`:
import torch
torch.backends.cuda.matmul.allow_tf32 = True
CUDA device management for FSDP in `utils.py:148-153`:
def init_distributed(rank: int, world_size: int, master_addr: str = 'localhost', port: int = 12355, backend: str = 'nccl'):
print(rank, 'initializing distributed')
os.environ["MASTER_ADDR"] = master_addr
os.environ["MASTER_PORT"] = str(port)
dist.init_process_group(backend, rank=rank, world_size=world_size)
torch.cuda.set_device(rank)
Multi-GPU detection and FSDP spawning in `train.py:105-111`:
if 'FSDP' in config.trainer:
world_size = torch.cuda.device_count()
print('starting', world_size, 'processes for FSDP training')
soft, hard = resource.getrlimit(resource.RLIMIT_NOFILE)
resource.setrlimit(resource.RLIMIT_NOFILE, (hard, hard))
print(f'setting RLIMIT_NOFILE soft limit to {hard} from {soft}')
mp.spawn(worker_main, nprocs=world_size, args=(world_size, config, policy, reference_model), join=True)
GPU memory diagnostics in `utils.py:106-117`:
def print_gpu_memory(rank: int = None, message: str = ''):
if torch.cuda.is_available():
device_count = torch.cuda.device_count()
for i in range(device_count):
device = torch.device(f'cuda:{i}')
allocated_bytes = torch.cuda.memory_allocated(device)
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `RuntimeError: NCCL error` | NCCL not installed or GPU communication failure | Ensure NCCL is installed and all GPUs are visible via `CUDA_VISIBLE_DEVICES` |
| `RuntimeError: CUDA out of memory` | Insufficient GPU VRAM for model + optimizer states | Enable activation checkpointing (`activation_checkpointing=true`), use mixed precision (`model.fsdp_policy_mp=bfloat16`), or reduce batch size |
| `OSError: [Errno 24] Too many open files` | File descriptor limit too low for FSDP | Run `ulimit -n 64000` before training |
| `ValueError: Could not find block class X in model` | Incorrect `model.block_name` for FSDP wrapping | Verify the transformer block class name matches the model architecture (e.g., `GPT2Block`, `GPTNeoXLayer`, `LlamaDecoderLayer`) |
Compatibility Notes
- BasicTrainer: Uses `device_map='balanced'` to naively split model layers across available GPUs. No FSDP or distributed init required.
- FSDPTrainer: Requires NCCL backend (Linux only). Uses `torch.distributed` with `mp.spawn`. Mixed precision supported via `MixedPrecision` policy.
- TensorParallelTrainer: Uses `tensor_parallel` library. Experimental; sampling is extremely slow (see BlackSamorez/tensor_parallel#66).
- TF32 Precision: Enabled globally. Provides faster matmul on Ampere+ GPUs with minimal precision loss.
Related Pages
- Implementation:Eric_mitchell_Direct_preference_optimization_Preference_Loss
- Implementation:Eric_mitchell_Direct_preference_optimization_Get_Batch_Logps
- Implementation:Eric_mitchell_Direct_preference_optimization_Disable_Dropout
- Implementation:Eric_mitchell_Direct_preference_optimization_BasicTrainer_Train
- Implementation:Eric_mitchell_Direct_preference_optimization_BasicTrainer_Save
- Implementation:Eric_mitchell_Direct_preference_optimization_Concatenated_Forward
- Implementation:Eric_mitchell_Direct_preference_optimization_Concatenated_Inputs_Fn
- Implementation:Eric_mitchell_Direct_preference_optimization_Torch_Load_State_Dict
- Implementation:Eric_mitchell_Direct_preference_optimization_Get_Batch_Metrics