Environment:Princeton nlp SimPO CUDA Training

Knowledge Sources	SimPO PyTorch alignment-handbook
Domains	Infrastructure, Deep_Learning, NLP
Last Updated	2026-02-08 05:00 GMT

Overview

Linux environment with CUDA 12.1, Python 3.10, PyTorch 2.2.2, and HuggingFace stack for SimPO preference optimization training and reward model inference.

Description

This environment provides the full GPU-accelerated context for running SimPO training (scripts/run_simpo.py) and reward model annotation (on_policy_data_gen/reward_model_annotate.py). It is built on Python 3.10 with CUDA 12.1 libraries (via nvidia-cuda-runtime-cu12), PyTorch 2.2.2, Transformers 4.44.2, TRL 0.9.6, and Accelerate 0.29.2. The environment supports DeepSpeed ZeRO-3 and FSDP distributed training strategies with bf16 mixed precision. Flash Attention 2 is required for Llama and Mistral models. The training configs are designed for 4xH100 GPUs.

Usage

Use this environment for any SimPO Training or Reward Model Annotation workflow that requires GPU acceleration. It is the mandatory prerequisite for running the SimPOTrainer implementation and the Reward_Model_Annotate_Script implementation.

System Requirements

Category	Requirement	Notes
OS	Linux (Ubuntu recommended)	Tested on Linux with CUDA toolkit
Hardware	NVIDIA GPU (4x H100 recommended)	Training configs set for 4 GPUs; adjust num_processes and per_device_train_batch_size for other setups
VRAM	Minimum 40GB per GPU (H100 80GB preferred)	7B-9B parameter models with bf16 and gradient checkpointing
Disk	100GB+ SSD	Model weights, datasets, and checkpoints

Dependencies

System Packages

`cuda-toolkit` = 12.1 (nvidia-cuda-runtime-cu12==12.1.105)
`cudnn` = 8.9.2 (nvidia-cudnn-cu12==8.9.2.26)
`nccl` = 2.19.3 (nvidia-nccl-cu12==2.19.3, for multi-GPU communication)

Python Packages

Core Framework:

`python` = 3.10
`torch` = 2.2.2
`torchvision` = 0.17.2
`torchaudio` = 2.2.2
`triton` = 2.2.0

HuggingFace Stack:

`transformers` = 4.44.2
`trl` = 0.9.6
`accelerate` = 0.29.2
`datasets` = 2.18.0
`tokenizers` = 0.15.2
`peft` = 0.7.1
`bitsandbytes` = 0.41.2.post2
`safetensors` = 0.4.2

Distributed Training:

`deepspeed` = 0.12.2

Attention:

`flash-attn` = 2.5.7 (required for Llama/Mistral; install with `pip install flash-attn --no-build-isolation`)

Logging:

`wandb` = 0.13.11

Conda Environment

The repository provides an environment.yml for exact reproducibility:

conda env create -f environment.yml
conda activate simpo

Credentials

The following environment variables may be needed:

`HF_TOKEN`: HuggingFace API token for downloading gated models (e.g., Llama-3, Gemma-2)
`WANDB_API_KEY`: Weights & Biases API key for experiment tracking (required if `report_to: wandb` in config)

Quick Install

# Create conda environment
conda create -n handbook python=3.10 && conda activate handbook

# Install PyTorch (hardware-dependent, see https://pytorch.org/get-started/locally/)
pip install torch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2

# Install alignment-handbook (includes transformers, trl, accelerate, etc.)
git clone https://github.com/huggingface/alignment-handbook.git
cd alignment-handbook && python -m pip install . && cd ..

# Install Flash Attention 2
pip install flash-attn --no-build-isolation

# Install remaining dependencies
pip install deepspeed==0.12.2 wandb bitsandbytes peft

Code Evidence

GPU device detection from `alignment/model_utils.py:33-35`:

def get_current_device() -> int:
    """Get the current device. For GPU we return the local process index to enable multiple GPU training."""
    return Accelerator().local_process_index if torch.cuda.is_available() else "cpu"

Quantized model device mapping from `alignment/model_utils.py:38-40`:

def get_kbit_device_map() -> Dict[str, int] | None:
    """Useful for running inference with quantized models by setting `device_map=get_peft_device_map()`"""
    return {"": get_current_device()} if torch.cuda.is_available() else None

Reward model loaded on CUDA with bf16 from `on_policy_data_gen/reward_model_annotate.py:26-28`:

model = AutoModelForSequenceClassification.from_pretrained(args.reward_model,
                                                           device_map="cuda",
                                                           trust_remote_code=True, torch_dtype=torch.bfloat16)

CUDA autocast for bf16 PEFT models from `scripts/simpo_trainer.py:744`:

compute_loss_context_manager = torch.cuda.amp.autocast if self._peft_has_been_casted_to_bf16 else nullcontext

Flash Attention 2 configuration from `alignment/configs.py:150-157`:

attn_implementation: Optional[str] = field(
    default=None,
    metadata={
        "help": (
            "Which attention implementation to use; you can use --attn_implementation=flash_attention_2, "
            "in which case you must install this manually by running `pip install flash-attn --no-build-isolation`"
        )
    },
)

Common Errors

Error Message	Cause	Solution
`RuntimeError: FlashAttention only supports Ampere GPUs or newer`	GPU compute capability < 8.0	Use `attn_implementation: eager` instead of `flash_attention_2` in training config
`CUDA out of memory`	Insufficient VRAM for model + batch	Reduce `per_device_train_batch_size`, increase `gradient_accumulation_steps`, or enable `gradient_checkpointing: true`
`ImportError: ... flash-attn`	flash-attn not installed	`pip install flash-attn --no-build-isolation`
`ValueError: PEFT is not installed and you passed a peft_config`	peft package missing	`pip install peft`
`AttributeError: Your Trainer does not have an accelerator object`	Outdated transformers version	Upgrade transformers to >= 4.37

Compatibility Notes

Flash Attention 2: Required for Llama-3 and Mistral models (`attn_implementation: flash_attention_2`). For Gemma-2 models, use `attn_implementation: eager` instead.
DeepSpeed ZeRO-3 vs FSDP: The repo provides configs for both. DeepSpeed ZeRO-3 (`accelerate_configs/deepspeed_zero3.yaml`) is configured for 4 GPUs. FSDP (`accelerate_configs/fsdp.yaml`) is configured for 8 GPUs. Choose based on your hardware setup.
bf16 Mixed Precision: All training configs use `bf16: true`. Requires GPU with bf16 support (Ampere or newer).
Gradient Checkpointing: All training configs enable `gradient_checkpointing: true` with `use_reentrant: False` to reduce VRAM usage.
4-bit Quantization: Supported via bitsandbytes with `load_in_4bit: true` and `bnb_4bit_quant_type: nf4`.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment