Environment:ContextualAI HALOs CUDA 12 1 Training Environment

Knowledge Sources	ContextualAI HALOs PyTorch CUDA Compatibility NVIDIA CUDA Toolkit
Domains	Infrastructure, Deep_Learning, LLM_Alignment
Last Updated	2026-02-08 03:00 GMT

Overview

Linux environment with CUDA 12.1, Python 3.10.14, PyTorch 2.4.0, and multi-GPU FSDP support for LLM alignment training (DPO, KTO, GRPO, PPO).

Description

This environment provides the full GPU-accelerated stack required to train and evaluate language models using the HALOs framework. It is built on a Conda base with Python 3.10.14, PyTorch 2.4.0 with CUDA 12.1, and FlashAttention 2.6.3. The stack includes HuggingFace Transformers, Accelerate (with FSDP support), PEFT for LoRA training, vLLM for batched sampling, and evaluation tools (AlpacaEval, lm-evaluation-harness). Training uses Fully Sharded Data Parallel (FSDP) by default and supports configurations from 2 to 16 GPUs across 1 or 2 nodes.

Usage

Use this environment for all HALOs workflows: supervised fine-tuning (SFT), preference alignment (DPO, KTO, GRPO, PPO, CDPO, IPO, SimPO, SLiC), reward model training (Bradley-Terry), online iterative alignment (sample-label-train loops), and model evaluation (AlpacaEval, LM Eval Harness). It is the mandatory prerequisite for every Implementation page in this repository.

System Requirements

Category	Requirement	Notes
OS	Linux (tested with SLURM clusters)	Scripts use `srun`, `scontrol`, `module load`
Hardware	NVIDIA GPU(s) with CUDA 12.1 support	Minimum 2 GPUs; typical configs use 4 or 8 GPUs
RAM	~800 GB for Llama-13B scale models	Per launch.py docstring: "allocate enough RAM"
Disk	Sufficient for model checkpoints + cached datasets	Models saved to `cache_dir/exp_name/FINAL`
Network	NCCL for multi-GPU/multi-node communication	Port range 29500-29510 used for distributed training

Dependencies

System Packages

`cuda-toolkit` = 12.1 (via `pytorch-cuda=12.1`)
`ninja` (build tool for FlashAttention compilation)
`git` (for cloning lm-evaluation-harness)
`netstat` (used in launch scripts for port checking)

Python Runtime

Python 3.10.14 (via Conda)

Python Packages

`torch` = 2.4.0
`flash-attn` = 2.6.3
`transformers` = 4.51.3
`peft` = 0.12.0
`datasets` = 2.20.0
`accelerate` = 0.33.0
`vllm` = 0.6.3.post1
`hydra-core` = 1.3.2
`omegaconf` (Hydra dependency)
`wandb` (experiment tracking)
`openai` (API-based labeling)
`alpaca-eval` (evaluation)
`immutabledict` (alpaca-eval dependency)
`langdetect` (alpaca-eval dependency)
`numpy`
`tqdm`
`lm-eval` (installed from source: EleutherAI/lm-evaluation-harness)

Credentials

The following environment variables may be required depending on the workflow:

`HF_TOKEN`: HuggingFace API token for downloading gated models (e.g., Llama). Set via `huggingface-cli login`.
`HF_HOME`: Controls where HuggingFace caches models and datasets.
`HF_DATASETS_OFFLINE`: Set to `1` for offline mode (auto-detected by `set_offline_if_needed()`).
`HF_HUB_OFFLINE`: Set to `1` for offline mode on disconnected clusters.
`OPENAI_API_KEY`: Required for AlpacaEval benchmarking and API-based labeling.
`WANDB_API_KEY`: Required for Weights & Biases experiment tracking (set via `wandb login`).
`WANDB_CACHE_DIR`: Automatically set to `config.cache_dir` during training.
`MASTER_ADDR`: Set automatically by SLURM scripts for distributed training.
`MASTER_PORT`: Set automatically by SLURM scripts (port 29500-29510).

Quick Install

# Create conda environment
conda create --name halos python=3.10.14
conda activate halos

# Install core packages
conda install pip
pip install packaging ninja
conda install pytorch=2.4.0 pytorch-cuda=12.1 -c pytorch -c nvidia
pip install flash-attn==2.6.3 --no-build-isolation
pip install transformers==4.51.3 peft==0.12.0 datasets==2.20.0 accelerate==0.33.0
pip install vllm==0.6.3.post1
pip install alpaca-eval immutabledict langdetect wandb omegaconf openai hydra-core==1.3.2

# Install lm-eval from source
git clone --depth 1 https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness && pip install -e . && cd ..

Code Evidence

Environment setup from `install.sh:1-31`:

conda create --name halos python=3.10.14
conda activate halos
conda install pip
pip install packaging ninja
conda install pytorch=2.4.0 pytorch-cuda=12.1 -c pytorch -c nvidia
pip install flash-attn==2.6.3 --no-build-isolation
pip install transformers==4.51.3
pip install peft==0.12.0

CUDA requirement from `train/utils.py:215-224` (GPU memory diagnostic):

def print_gpu_memory(rank: int = None, message: str = ''):
    if torch.cuda.is_available():
        device_count = torch.cuda.device_count()
        for i in range(device_count):
            device = torch.device(f'cuda:{i}')
            allocated_bytes = torch.cuda.memory_allocated(device)

Batch size divisibility check from `launch.py:65-68`:

if config.model.batch_size % (accelerator.num_processes * config.model.gradient_accumulation_steps) == 0:
    config.model.microbatch_size = config.model.batch_size / (accelerator.num_processes * config.model.gradient_accumulation_steps)
else:
    raise ValueError(f"{config.model.batch_size} needs to be divisible by the number of processes * gradient_accumulation_steps")

FSDP configuration from `accelerate_config/fsdp_4gpu.yaml:1-24`:

distributed_type: FSDP
fsdp_config:
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_state_dict_type: SHARDED_STATE_DICT
  fsdp_cpu_ram_efficient_loading: true
num_processes: 4

Common Errors

Error Message	Cause	Solution
`torch.distributed.DistBackendError: Socket Timeout`	WandB not configured; master node waits for input while workers block	Run `wandb login` then `wandb offline` if GPUs lack Internet access
`ValueError: batch_size needs to be divisible by num_processes * gradient_accumulation_steps`	Batch size not evenly divisible across GPUs and accumulation steps	Adjust `model.batch_size` to be a multiple of `num_processes * gradient_accumulation_steps`
`eval_every must be divisible by batch_size`	Evaluation interval not aligned with batch size	Code auto-corrects: rounds down `eval_every` to nearest multiple of `batch_size`
`can't use batch size of 1 with UnpairedPreferenceDataLoader`	KTO/GRPO requires mixed chosen/rejected examples in each microbatch	Ensure `microbatch_size * num_processes > 1`
FlashAttention build failure	Missing `ninja` build tool	`pip install packaging ninja` before installing `flash-attn`

Compatibility Notes

FSDP Configurations: Pre-built Accelerate configs provided for 2, 4, 8 GPUs (single-node) and 2x2, 2x4, 2x8 GPUs (multi-node). Custom configs needed for other topologies.
SLURM Integration: Launch scripts assume SLURM scheduler with `module load anaconda3/2024.2`. Adjust for non-SLURM environments.
Offline Mode: The `set_offline_if_needed()` utility auto-detects whether HuggingFace Hub is accessible and falls back to offline mode. Scripts also explicitly set `HF_DATASETS_OFFLINE=1` and `HF_HUB_OFFLINE=1`.
FlashAttention: Optional; set `model.attn_implementation=flash_attention_2` in config. Only works with `float16` or `bfloat16` dtypes; falls back to `eager` for other dtypes.
vLLM: Required only for the sampling step (`train.sample`). Uses tensor parallelism for multi-GPU inference.
Package Versions: The README explicitly warns: "The package versions are important---if you change them, there is no guarantee the code will run."

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment