Environment:ContextualAI HALOs CUDA 12 1 Training Environment
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Deep_Learning, LLM_Alignment |
| Last Updated | 2026-02-08 03:00 GMT |
Overview
Linux environment with CUDA 12.1, Python 3.10.14, PyTorch 2.4.0, and multi-GPU FSDP support for LLM alignment training (DPO, KTO, GRPO, PPO).
Description
This environment provides the full GPU-accelerated stack required to train and evaluate language models using the HALOs framework. It is built on a Conda base with Python 3.10.14, PyTorch 2.4.0 with CUDA 12.1, and FlashAttention 2.6.3. The stack includes HuggingFace Transformers, Accelerate (with FSDP support), PEFT for LoRA training, vLLM for batched sampling, and evaluation tools (AlpacaEval, lm-evaluation-harness). Training uses Fully Sharded Data Parallel (FSDP) by default and supports configurations from 2 to 16 GPUs across 1 or 2 nodes.
Usage
Use this environment for all HALOs workflows: supervised fine-tuning (SFT), preference alignment (DPO, KTO, GRPO, PPO, CDPO, IPO, SimPO, SLiC), reward model training (Bradley-Terry), online iterative alignment (sample-label-train loops), and model evaluation (AlpacaEval, LM Eval Harness). It is the mandatory prerequisite for every Implementation page in this repository.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux (tested with SLURM clusters) | Scripts use `srun`, `scontrol`, `module load` |
| Hardware | NVIDIA GPU(s) with CUDA 12.1 support | Minimum 2 GPUs; typical configs use 4 or 8 GPUs |
| RAM | ~800 GB for Llama-13B scale models | Per launch.py docstring: "allocate enough RAM" |
| Disk | Sufficient for model checkpoints + cached datasets | Models saved to `cache_dir/exp_name/FINAL` |
| Network | NCCL for multi-GPU/multi-node communication | Port range 29500-29510 used for distributed training |
Dependencies
System Packages
- `cuda-toolkit` = 12.1 (via `pytorch-cuda=12.1`)
- `ninja` (build tool for FlashAttention compilation)
- `git` (for cloning lm-evaluation-harness)
- `netstat` (used in launch scripts for port checking)
Python Runtime
- Python 3.10.14 (via Conda)
Python Packages
- `torch` = 2.4.0
- `flash-attn` = 2.6.3
- `transformers` = 4.51.3
- `peft` = 0.12.0
- `datasets` = 2.20.0
- `accelerate` = 0.33.0
- `vllm` = 0.6.3.post1
- `hydra-core` = 1.3.2
- `omegaconf` (Hydra dependency)
- `wandb` (experiment tracking)
- `openai` (API-based labeling)
- `alpaca-eval` (evaluation)
- `immutabledict` (alpaca-eval dependency)
- `langdetect` (alpaca-eval dependency)
- `numpy`
- `tqdm`
- `lm-eval` (installed from source: EleutherAI/lm-evaluation-harness)
Credentials
The following environment variables may be required depending on the workflow:
- `HF_TOKEN`: HuggingFace API token for downloading gated models (e.g., Llama). Set via `huggingface-cli login`.
- `HF_HOME`: Controls where HuggingFace caches models and datasets.
- `HF_DATASETS_OFFLINE`: Set to `1` for offline mode (auto-detected by `set_offline_if_needed()`).
- `HF_HUB_OFFLINE`: Set to `1` for offline mode on disconnected clusters.
- `OPENAI_API_KEY`: Required for AlpacaEval benchmarking and API-based labeling.
- `WANDB_API_KEY`: Required for Weights & Biases experiment tracking (set via `wandb login`).
- `WANDB_CACHE_DIR`: Automatically set to `config.cache_dir` during training.
- `MASTER_ADDR`: Set automatically by SLURM scripts for distributed training.
- `MASTER_PORT`: Set automatically by SLURM scripts (port 29500-29510).
Quick Install
# Create conda environment
conda create --name halos python=3.10.14
conda activate halos
# Install core packages
conda install pip
pip install packaging ninja
conda install pytorch=2.4.0 pytorch-cuda=12.1 -c pytorch -c nvidia
pip install flash-attn==2.6.3 --no-build-isolation
pip install transformers==4.51.3 peft==0.12.0 datasets==2.20.0 accelerate==0.33.0
pip install vllm==0.6.3.post1
pip install alpaca-eval immutabledict langdetect wandb omegaconf openai hydra-core==1.3.2
# Install lm-eval from source
git clone --depth 1 https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness && pip install -e . && cd ..
Code Evidence
Environment setup from `install.sh:1-31`:
conda create --name halos python=3.10.14
conda activate halos
conda install pip
pip install packaging ninja
conda install pytorch=2.4.0 pytorch-cuda=12.1 -c pytorch -c nvidia
pip install flash-attn==2.6.3 --no-build-isolation
pip install transformers==4.51.3
pip install peft==0.12.0
CUDA requirement from `train/utils.py:215-224` (GPU memory diagnostic):
def print_gpu_memory(rank: int = None, message: str = ''):
if torch.cuda.is_available():
device_count = torch.cuda.device_count()
for i in range(device_count):
device = torch.device(f'cuda:{i}')
allocated_bytes = torch.cuda.memory_allocated(device)
Batch size divisibility check from `launch.py:65-68`:
if config.model.batch_size % (accelerator.num_processes * config.model.gradient_accumulation_steps) == 0:
config.model.microbatch_size = config.model.batch_size / (accelerator.num_processes * config.model.gradient_accumulation_steps)
else:
raise ValueError(f"{config.model.batch_size} needs to be divisible by the number of processes * gradient_accumulation_steps")
FSDP configuration from `accelerate_config/fsdp_4gpu.yaml:1-24`:
distributed_type: FSDP
fsdp_config:
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_sharding_strategy: FULL_SHARD
fsdp_state_dict_type: SHARDED_STATE_DICT
fsdp_cpu_ram_efficient_loading: true
num_processes: 4
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `torch.distributed.DistBackendError: Socket Timeout` | WandB not configured; master node waits for input while workers block | Run `wandb login` then `wandb offline` if GPUs lack Internet access |
| `ValueError: batch_size needs to be divisible by num_processes * gradient_accumulation_steps` | Batch size not evenly divisible across GPUs and accumulation steps | Adjust `model.batch_size` to be a multiple of `num_processes * gradient_accumulation_steps` |
| `eval_every must be divisible by batch_size` | Evaluation interval not aligned with batch size | Code auto-corrects: rounds down `eval_every` to nearest multiple of `batch_size` |
| `can't use batch size of 1 with UnpairedPreferenceDataLoader` | KTO/GRPO requires mixed chosen/rejected examples in each microbatch | Ensure `microbatch_size * num_processes > 1` |
| FlashAttention build failure | Missing `ninja` build tool | `pip install packaging ninja` before installing `flash-attn` |
Compatibility Notes
- FSDP Configurations: Pre-built Accelerate configs provided for 2, 4, 8 GPUs (single-node) and 2x2, 2x4, 2x8 GPUs (multi-node). Custom configs needed for other topologies.
- SLURM Integration: Launch scripts assume SLURM scheduler with `module load anaconda3/2024.2`. Adjust for non-SLURM environments.
- Offline Mode: The `set_offline_if_needed()` utility auto-detects whether HuggingFace Hub is accessible and falls back to offline mode. Scripts also explicitly set `HF_DATASETS_OFFLINE=1` and `HF_HUB_OFFLINE=1`.
- FlashAttention: Optional; set `model.attn_implementation=flash_attention_2` in config. Only works with `float16` or `bfloat16` dtypes; falls back to `eager` for other dtypes.
- vLLM: Required only for the sampling step (`train.sample`). Uses tensor parallelism for multi-GPU inference.
- Package Versions: The README explicitly warns: "The package versions are important---if you change them, there is no guarantee the code will run."
Related Pages
- Implementation:ContextualAI_HALOs_Install_Sh
- Implementation:ContextualAI_HALOs_Dataset_Loaders
- Implementation:ContextualAI_HALOs_SFTTrainer_Train
- Implementation:ContextualAI_HALOs_Alignment_Trainers
- Implementation:ContextualAI_HALOs_BasicTrainer_Save
- Implementation:ContextualAI_HALOs_Sample_Main
- Implementation:ContextualAI_HALOs_Label_Main
- Implementation:ContextualAI_HALOs_Online_Training_Main
- Implementation:ContextualAI_HALOs_Online_Loop_Script
- Implementation:ContextualAI_HALOs_AutoModelForBradleyTerry_From_Pretrained
- Implementation:ContextualAI_HALOs_BradleyTerryTrainer_Train
- Implementation:ContextualAI_HALOs_Alpaca_Eval_CLI
- Implementation:ContextualAI_HALOs_LM_Eval_CLI
- Implementation:ContextualAI_HALOs_Summarize_Metrics