Environment:LLMBook zh LLMBook zh github io PyTorch CUDA GPU Environment
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Deep_Learning |
| Last Updated | 2026-02-08 04:30 GMT |
Overview
Linux environment with NVIDIA CUDA GPU support, PyTorch with CUDA backend, and NumPy for tensor computation across all LLM training, inference, and architecture code.
Description
This environment provides the foundational GPU-accelerated compute layer for all deep learning operations in the LLMBook codebase. PyTorch serves as the core tensor computation framework, used across 13+ source files for model architecture definitions (RMSNorm, RoPE, ALiBi, MoE, LLaMA), training loops (pre-training, SFT, LoRA, DPO), and inference/quantization workflows. CUDA GPU access is required for training scripts that use device_map="auto", mixed-precision BF16 training, and GPU memory monitoring via torch.cuda.memory_allocated().
Usage
Use this environment for all training, fine-tuning, and inference workflows in the LLMBook codebase. It is a mandatory prerequisite for every Implementation that imports torch or uses nn.Module subclasses. The GPU requirement is especially critical for pre-training (Ch. 6), SFT (Ch. 7), DPO alignment (Ch. 8), and quantization/inference (Ch. 9).
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux (Ubuntu recommended) | CUDA toolkit requires Linux for full support |
| Hardware | NVIDIA GPU with CUDA support | Minimum 8GB VRAM for quantized models; 16GB+ for full-precision training of 7B models |
| Hardware | BF16-capable GPU | Ampere (A100) or newer for native BF16; RTX 30/40 series also supported |
| Disk | 50GB+ SSD | For model weights (7B model ~14GB in fp16) and dataset caching |
Dependencies
System Packages
- `nvidia-driver` >= 525
- `cuda-toolkit` >= 11.7
Python Packages
- `torch` >= 1.13 (with CUDA support)
- `numpy` >= 1.21
Credentials
No credentials required for this base environment.
Quick Install
# Install PyTorch with CUDA support
pip install torch numpy
# Verify CUDA availability
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"
Code Evidence
GPU memory monitoring from `code/9.3 bitsandbytes实践.py:7`:
print(f"memory usage: {torch.cuda.memory_allocated()/1000/1000/1000} GB")
Automatic device mapping from `code/9.3 bitsandbytes实践.py:6`:
model_8bit = AutoModelForCausalLM.from_pretrained(name, device_map="auto", load_in_8bit=True)
BF16 mixed precision requirement from `code/6.2 预训练实践.py:37-38`:
bf16: bool = HfArg(
default=True,
help="Whether to use bf16 (mixed) precision instead of 32-bit.",
)
PyTorch tensor operations used extensively across architecture files, e.g. `code/5.1 RMSNorm.py`, `code/5.2 RoPE.py`, `code/5.4 MoE.py`.
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `RuntimeError: CUDA out of memory` | Insufficient GPU VRAM for model size | Use quantization (4-bit/8-bit) or reduce batch size |
| `RuntimeError: No CUDA GPUs are available` | No NVIDIA GPU detected | Verify `nvidia-smi` output; install NVIDIA drivers |
| `RuntimeError: expected scalar type BFloat16` | GPU does not support BF16 | Use `fp16=True` instead of `bf16=True` on pre-Ampere GPUs |
Compatibility Notes
- Pre-Ampere GPUs (V100, RTX 2080): BF16 not natively supported; use FP16 mixed precision instead.
- Multi-GPU: DeepSpeed integration available in LoRA training script (`code/7.4 LoRA实践.py`) for distributed training.
- CPU-only: Data preprocessing scripts (Ch. 4: quality filtering, deduplication, BPE) do not require GPU.
Related Pages
- Implementation:LLMBook_zh_LLMBook_zh_github_io_AutoModelForCausalLM_From_Pretrained_Pretraining
- Implementation:LLMBook_zh_LLMBook_zh_github_io_LlamaForCausalLM_Forward
- Implementation:LLMBook_zh_LLMBook_zh_github_io_Trainer_Train_Pretraining
- Implementation:LLMBook_zh_LLMBook_zh_github_io_Trainer_Save_Model_Pretraining
- Implementation:LLMBook_zh_LLMBook_zh_github_io_AutoModelForCausalLM_From_Pretrained_SFT
- Implementation:LLMBook_zh_LLMBook_zh_github_io_Trainer_Train_SFT
- Implementation:LLMBook_zh_LLMBook_zh_github_io_LoRALinear
- Implementation:LLMBook_zh_LLMBook_zh_github_io_LoraConfig_Get_Peft_Model
- Implementation:LLMBook_zh_LLMBook_zh_github_io_Trainer_Train_LoRA
- Implementation:LLMBook_zh_LLMBook_zh_github_io_AutoPeftModelForCausalLM_Merge_And_Unload
- Implementation:LLMBook_zh_LLMBook_zh_github_io_LlamaRewardModel
- Implementation:LLMBook_zh_LLMBook_zh_github_io_AutoModelForCausalLM_From_Pretrained_DPO
- Implementation:LLMBook_zh_LLMBook_zh_github_io_DPOTrainer_Train
- Implementation:LLMBook_zh_LLMBook_zh_github_io_VLLM_LLM_Generate
- Implementation:LLMBook_zh_LLMBook_zh_github_io_Quantize_Func
- Implementation:LLMBook_zh_LLMBook_zh_github_io_AutoModelForCausalLM_From_Pretrained_Bitsandbytes
- Implementation:LLMBook_zh_LLMBook_zh_github_io_GPTQConfig_Quantization