Environment:VainF Torch Pruning CUDA GPU Benchmarking
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Deep_Learning |
| Last Updated | 2026-02-08 12:00 GMT |
Overview
NVIDIA CUDA GPU environment required for latency benchmarking, memory profiling, and GPU-accelerated training workflows.
Description
This environment extends the core PyTorch/Python environment with a CUDA-capable NVIDIA GPU. While the Torch-Pruning library's core operations (dependency graph analysis, pruning execution, importance estimation) work on CPU, the benchmarking utilities (measure_latency, measure_memory, measure_fps, measure_throughput) unconditionally use CUDA timing events and memory APIs. Training workflows in the examples and reproduce directories also assume GPU availability.
The benchmarking module uses torch.cuda.Event(enable_timing=True) for precise latency measurement, torch.cuda.synchronize() for timing accuracy, and torch.cuda.max_memory_allocated() for peak memory tracking. These calls will fail without a CUDA-capable GPU.
Usage
Use this environment when:
- Running latency benchmarks to measure inference speed before/after pruning
- Running memory profiling to measure peak VRAM usage
- Training or fine-tuning models in the examples or reproduce directories
- Distributed training with multiple GPUs (requires
RANK,WORLD_SIZE,LOCAL_RANKenvironment variables)
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux (recommended), Windows with CUDA | macOS not supported for CUDA |
| Hardware | NVIDIA GPU with CUDA support | No minimum VRAM specified; depends on model size |
| Driver | NVIDIA CUDA Toolkit | Must match PyTorch CUDA version |
Dependencies
System Packages
- NVIDIA GPU driver (compatible with CUDA toolkit version)
- CUDA toolkit (bundled with PyTorch binary or installed separately)
Python Packages
torch>= 2.0 (CUDA build, not CPU-only)- All core dependencies from the PyTorch_Python_Core environment
Credentials
Distributed training environment variables (optional, only for multi-GPU):
RANK: Global rank of the current processWORLD_SIZE: Total number of processesLOCAL_RANK: Local rank on the current nodeSLURM_PROCID: Alternative rank source for SLURM clusters
Quick Install
# Install PyTorch with CUDA support (example for CUDA 12.1)
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121
# Verify CUDA availability
python -c "import torch; print(torch.cuda.is_available())"
Code Evidence
CUDA timing events in torch_pruning/utils/benchmark.py:31-39:
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record()
if run_fn is not None:
_ = run_fn(model, example_inputs)
else:
_ = model(example_inputs)
end.record()
torch.cuda.synchronize()
Peak memory measurement in torch_pruning/utils/benchmark.py:59-65:
torch.cuda.reset_peak_memory_stats()
model.eval()
if run_fn is not None:
_ = run_fn(model, example_inputs)
else:
_ = model(example_inputs)
return torch.cuda.max_memory_allocated(device=device)
Distributed training environment variable detection from reproduce/engine/utils/imagenet_utils/utils.py:246-251:
if "RANK" in os.environ and "WORLD_SIZE" in os.environ:
args.rank = int(os.environ["RANK"])
args.world_size = int(os.environ["WORLD_SIZE"])
args.gpu = int(os.environ.get("LOCAL_RANK", 0))
elif "SLURM_PROCID" in os.environ:
args.rank = int(os.environ["SLURM_PROCID"])
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
RuntimeError: CUDA error: no CUDA-capable device is detected |
No NVIDIA GPU present or driver not installed | Install NVIDIA drivers and ensure GPU is visible via nvidia-smi
|
RuntimeError: CUDA out of memory |
Model too large for available VRAM | Reduce batch size, use a smaller model, or use gradient checkpointing |
AssertionError: Default process group has not been initialized |
Distributed env vars not set | Set RANK, WORLD_SIZE, LOCAL_RANK or use torchrun to launch
|
Compatibility Notes
- Benchmark functions:
measure_latency,measure_memory,measure_fps,measure_throughputall require CUDA and will fail on CPU-only machines. - Core pruning operations: The dependency graph builder, pruning functions, and importance estimators work on any device (CPU or GPU).
- Test suite: Tests use
device = 'cuda' if torch.cuda.is_available() else 'cpu'for conditional device selection.