Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:ContextualAI HALOs CUDA 12 1 Training Environment

From Leeroopedia


Knowledge Sources
Domains Infrastructure, Deep_Learning, LLM_Alignment
Last Updated 2026-02-08 03:00 GMT

Overview

Linux environment with CUDA 12.1, Python 3.10.14, PyTorch 2.4.0, and multi-GPU FSDP support for LLM alignment training (DPO, KTO, GRPO, PPO).

Description

This environment provides the full GPU-accelerated stack required to train and evaluate language models using the HALOs framework. It is built on a Conda base with Python 3.10.14, PyTorch 2.4.0 with CUDA 12.1, and FlashAttention 2.6.3. The stack includes HuggingFace Transformers, Accelerate (with FSDP support), PEFT for LoRA training, vLLM for batched sampling, and evaluation tools (AlpacaEval, lm-evaluation-harness). Training uses Fully Sharded Data Parallel (FSDP) by default and supports configurations from 2 to 16 GPUs across 1 or 2 nodes.

Usage

Use this environment for all HALOs workflows: supervised fine-tuning (SFT), preference alignment (DPO, KTO, GRPO, PPO, CDPO, IPO, SimPO, SLiC), reward model training (Bradley-Terry), online iterative alignment (sample-label-train loops), and model evaluation (AlpacaEval, LM Eval Harness). It is the mandatory prerequisite for every Implementation page in this repository.

System Requirements

Category Requirement Notes
OS Linux (tested with SLURM clusters) Scripts use `srun`, `scontrol`, `module load`
Hardware NVIDIA GPU(s) with CUDA 12.1 support Minimum 2 GPUs; typical configs use 4 or 8 GPUs
RAM ~800 GB for Llama-13B scale models Per launch.py docstring: "allocate enough RAM"
Disk Sufficient for model checkpoints + cached datasets Models saved to `cache_dir/exp_name/FINAL`
Network NCCL for multi-GPU/multi-node communication Port range 29500-29510 used for distributed training

Dependencies

System Packages

  • `cuda-toolkit` = 12.1 (via `pytorch-cuda=12.1`)
  • `ninja` (build tool for FlashAttention compilation)
  • `git` (for cloning lm-evaluation-harness)
  • `netstat` (used in launch scripts for port checking)

Python Runtime

  • Python 3.10.14 (via Conda)

Python Packages

  • `torch` = 2.4.0
  • `flash-attn` = 2.6.3
  • `transformers` = 4.51.3
  • `peft` = 0.12.0
  • `datasets` = 2.20.0
  • `accelerate` = 0.33.0
  • `vllm` = 0.6.3.post1
  • `hydra-core` = 1.3.2
  • `omegaconf` (Hydra dependency)
  • `wandb` (experiment tracking)
  • `openai` (API-based labeling)
  • `alpaca-eval` (evaluation)
  • `immutabledict` (alpaca-eval dependency)
  • `langdetect` (alpaca-eval dependency)
  • `numpy`
  • `tqdm`
  • `lm-eval` (installed from source: EleutherAI/lm-evaluation-harness)

Credentials

The following environment variables may be required depending on the workflow:

  • `HF_TOKEN`: HuggingFace API token for downloading gated models (e.g., Llama). Set via `huggingface-cli login`.
  • `HF_HOME`: Controls where HuggingFace caches models and datasets.
  • `HF_DATASETS_OFFLINE`: Set to `1` for offline mode (auto-detected by `set_offline_if_needed()`).
  • `HF_HUB_OFFLINE`: Set to `1` for offline mode on disconnected clusters.
  • `OPENAI_API_KEY`: Required for AlpacaEval benchmarking and API-based labeling.
  • `WANDB_API_KEY`: Required for Weights & Biases experiment tracking (set via `wandb login`).
  • `WANDB_CACHE_DIR`: Automatically set to `config.cache_dir` during training.
  • `MASTER_ADDR`: Set automatically by SLURM scripts for distributed training.
  • `MASTER_PORT`: Set automatically by SLURM scripts (port 29500-29510).

Quick Install

# Create conda environment
conda create --name halos python=3.10.14
conda activate halos

# Install core packages
conda install pip
pip install packaging ninja
conda install pytorch=2.4.0 pytorch-cuda=12.1 -c pytorch -c nvidia
pip install flash-attn==2.6.3 --no-build-isolation
pip install transformers==4.51.3 peft==0.12.0 datasets==2.20.0 accelerate==0.33.0
pip install vllm==0.6.3.post1
pip install alpaca-eval immutabledict langdetect wandb omegaconf openai hydra-core==1.3.2

# Install lm-eval from source
git clone --depth 1 https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness && pip install -e . && cd ..

Code Evidence

Environment setup from `install.sh:1-31`:

conda create --name halos python=3.10.14
conda activate halos
conda install pip
pip install packaging ninja
conda install pytorch=2.4.0 pytorch-cuda=12.1 -c pytorch -c nvidia
pip install flash-attn==2.6.3 --no-build-isolation
pip install transformers==4.51.3
pip install peft==0.12.0

CUDA requirement from `train/utils.py:215-224` (GPU memory diagnostic):

def print_gpu_memory(rank: int = None, message: str = ''):
    if torch.cuda.is_available():
        device_count = torch.cuda.device_count()
        for i in range(device_count):
            device = torch.device(f'cuda:{i}')
            allocated_bytes = torch.cuda.memory_allocated(device)

Batch size divisibility check from `launch.py:65-68`:

if config.model.batch_size % (accelerator.num_processes * config.model.gradient_accumulation_steps) == 0:
    config.model.microbatch_size = config.model.batch_size / (accelerator.num_processes * config.model.gradient_accumulation_steps)
else:
    raise ValueError(f"{config.model.batch_size} needs to be divisible by the number of processes * gradient_accumulation_steps")

FSDP configuration from `accelerate_config/fsdp_4gpu.yaml:1-24`:

distributed_type: FSDP
fsdp_config:
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_state_dict_type: SHARDED_STATE_DICT
  fsdp_cpu_ram_efficient_loading: true
num_processes: 4

Common Errors

Error Message Cause Solution
`torch.distributed.DistBackendError: Socket Timeout` WandB not configured; master node waits for input while workers block Run `wandb login` then `wandb offline` if GPUs lack Internet access
`ValueError: batch_size needs to be divisible by num_processes * gradient_accumulation_steps` Batch size not evenly divisible across GPUs and accumulation steps Adjust `model.batch_size` to be a multiple of `num_processes * gradient_accumulation_steps`
`eval_every must be divisible by batch_size` Evaluation interval not aligned with batch size Code auto-corrects: rounds down `eval_every` to nearest multiple of `batch_size`
`can't use batch size of 1 with UnpairedPreferenceDataLoader` KTO/GRPO requires mixed chosen/rejected examples in each microbatch Ensure `microbatch_size * num_processes > 1`
FlashAttention build failure Missing `ninja` build tool `pip install packaging ninja` before installing `flash-attn`

Compatibility Notes

  • FSDP Configurations: Pre-built Accelerate configs provided for 2, 4, 8 GPUs (single-node) and 2x2, 2x4, 2x8 GPUs (multi-node). Custom configs needed for other topologies.
  • SLURM Integration: Launch scripts assume SLURM scheduler with `module load anaconda3/2024.2`. Adjust for non-SLURM environments.
  • Offline Mode: The `set_offline_if_needed()` utility auto-detects whether HuggingFace Hub is accessible and falls back to offline mode. Scripts also explicitly set `HF_DATASETS_OFFLINE=1` and `HF_HUB_OFFLINE=1`.
  • FlashAttention: Optional; set `model.attn_implementation=flash_attention_2` in config. Only works with `float16` or `bfloat16` dtypes; falls back to `eager` for other dtypes.
  • vLLM: Required only for the sampling step (`train.sample`). Uses tensor parallelism for multi-GPU inference.
  • Package Versions: The README explicitly warns: "The package versions are important---if you change them, there is no guarantee the code will run."

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment