Environment:Princeton nlp SimPO CUDA Training
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Deep_Learning, NLP |
| Last Updated | 2026-02-08 05:00 GMT |
Overview
Linux environment with CUDA 12.1, Python 3.10, PyTorch 2.2.2, and HuggingFace stack for SimPO preference optimization training and reward model inference.
Description
This environment provides the full GPU-accelerated context for running SimPO training (scripts/run_simpo.py) and reward model annotation (on_policy_data_gen/reward_model_annotate.py). It is built on Python 3.10 with CUDA 12.1 libraries (via nvidia-cuda-runtime-cu12), PyTorch 2.2.2, Transformers 4.44.2, TRL 0.9.6, and Accelerate 0.29.2. The environment supports DeepSpeed ZeRO-3 and FSDP distributed training strategies with bf16 mixed precision. Flash Attention 2 is required for Llama and Mistral models. The training configs are designed for 4xH100 GPUs.
Usage
Use this environment for any SimPO Training or Reward Model Annotation workflow that requires GPU acceleration. It is the mandatory prerequisite for running the SimPOTrainer implementation and the Reward_Model_Annotate_Script implementation.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux (Ubuntu recommended) | Tested on Linux with CUDA toolkit |
| Hardware | NVIDIA GPU (4x H100 recommended) | Training configs set for 4 GPUs; adjust num_processes and per_device_train_batch_size for other setups |
| VRAM | Minimum 40GB per GPU (H100 80GB preferred) | 7B-9B parameter models with bf16 and gradient checkpointing |
| Disk | 100GB+ SSD | Model weights, datasets, and checkpoints |
Dependencies
System Packages
- `cuda-toolkit` = 12.1 (nvidia-cuda-runtime-cu12==12.1.105)
- `cudnn` = 8.9.2 (nvidia-cudnn-cu12==8.9.2.26)
- `nccl` = 2.19.3 (nvidia-nccl-cu12==2.19.3, for multi-GPU communication)
Python Packages
Core Framework:
- `python` = 3.10
- `torch` = 2.2.2
- `torchvision` = 0.17.2
- `torchaudio` = 2.2.2
- `triton` = 2.2.0
HuggingFace Stack:
- `transformers` = 4.44.2
- `trl` = 0.9.6
- `accelerate` = 0.29.2
- `datasets` = 2.18.0
- `tokenizers` = 0.15.2
- `peft` = 0.7.1
- `bitsandbytes` = 0.41.2.post2
- `safetensors` = 0.4.2
Distributed Training:
- `deepspeed` = 0.12.2
Attention:
- `flash-attn` = 2.5.7 (required for Llama/Mistral; install with `pip install flash-attn --no-build-isolation`)
Logging:
- `wandb` = 0.13.11
Conda Environment
The repository provides an environment.yml for exact reproducibility:
conda env create -f environment.yml
conda activate simpo
Credentials
The following environment variables may be needed:
- `HF_TOKEN`: HuggingFace API token for downloading gated models (e.g., Llama-3, Gemma-2)
- `WANDB_API_KEY`: Weights & Biases API key for experiment tracking (required if `report_to: wandb` in config)
Quick Install
# Create conda environment
conda create -n handbook python=3.10 && conda activate handbook
# Install PyTorch (hardware-dependent, see https://pytorch.org/get-started/locally/)
pip install torch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2
# Install alignment-handbook (includes transformers, trl, accelerate, etc.)
git clone https://github.com/huggingface/alignment-handbook.git
cd alignment-handbook && python -m pip install . && cd ..
# Install Flash Attention 2
pip install flash-attn --no-build-isolation
# Install remaining dependencies
pip install deepspeed==0.12.2 wandb bitsandbytes peft
Code Evidence
GPU device detection from `alignment/model_utils.py:33-35`:
def get_current_device() -> int:
"""Get the current device. For GPU we return the local process index to enable multiple GPU training."""
return Accelerator().local_process_index if torch.cuda.is_available() else "cpu"
Quantized model device mapping from `alignment/model_utils.py:38-40`:
def get_kbit_device_map() -> Dict[str, int] | None:
"""Useful for running inference with quantized models by setting `device_map=get_peft_device_map()`"""
return {"": get_current_device()} if torch.cuda.is_available() else None
Reward model loaded on CUDA with bf16 from `on_policy_data_gen/reward_model_annotate.py:26-28`:
model = AutoModelForSequenceClassification.from_pretrained(args.reward_model,
device_map="cuda",
trust_remote_code=True, torch_dtype=torch.bfloat16)
CUDA autocast for bf16 PEFT models from `scripts/simpo_trainer.py:744`:
compute_loss_context_manager = torch.cuda.amp.autocast if self._peft_has_been_casted_to_bf16 else nullcontext
Flash Attention 2 configuration from `alignment/configs.py:150-157`:
attn_implementation: Optional[str] = field(
default=None,
metadata={
"help": (
"Which attention implementation to use; you can use --attn_implementation=flash_attention_2, "
"in which case you must install this manually by running `pip install flash-attn --no-build-isolation`"
)
},
)
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `RuntimeError: FlashAttention only supports Ampere GPUs or newer` | GPU compute capability < 8.0 | Use `attn_implementation: eager` instead of `flash_attention_2` in training config |
| `CUDA out of memory` | Insufficient VRAM for model + batch | Reduce `per_device_train_batch_size`, increase `gradient_accumulation_steps`, or enable `gradient_checkpointing: true` |
| `ImportError: ... flash-attn` | flash-attn not installed | `pip install flash-attn --no-build-isolation` |
| `ValueError: PEFT is not installed and you passed a peft_config` | peft package missing | `pip install peft` |
| `AttributeError: Your Trainer does not have an accelerator object` | Outdated transformers version | Upgrade transformers to >= 4.37 |
Compatibility Notes
- Flash Attention 2: Required for Llama-3 and Mistral models (`attn_implementation: flash_attention_2`). For Gemma-2 models, use `attn_implementation: eager` instead.
- DeepSpeed ZeRO-3 vs FSDP: The repo provides configs for both. DeepSpeed ZeRO-3 (`accelerate_configs/deepspeed_zero3.yaml`) is configured for 4 GPUs. FSDP (`accelerate_configs/fsdp.yaml`) is configured for 8 GPUs. Choose based on your hardware setup.
- bf16 Mixed Precision: All training configs use `bf16: true`. Requires GPU with bf16 support (Ampere or newer).
- Gradient Checkpointing: All training configs enable `gradient_checkpointing: true` with `use_reentrant: False` to reduce VRAM usage.
- 4-bit Quantization: Supported via bitsandbytes with `load_in_4bit: true` and `bnb_4bit_quant_type: nf4`.