Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:VainF Torch Pruning CUDA GPU Benchmarking

From Leeroopedia


Knowledge Sources
Domains Infrastructure, Deep_Learning
Last Updated 2026-02-08 12:00 GMT

Overview

NVIDIA CUDA GPU environment required for latency benchmarking, memory profiling, and GPU-accelerated training workflows.

Description

This environment extends the core PyTorch/Python environment with a CUDA-capable NVIDIA GPU. While the Torch-Pruning library's core operations (dependency graph analysis, pruning execution, importance estimation) work on CPU, the benchmarking utilities (measure_latency, measure_memory, measure_fps, measure_throughput) unconditionally use CUDA timing events and memory APIs. Training workflows in the examples and reproduce directories also assume GPU availability.

The benchmarking module uses torch.cuda.Event(enable_timing=True) for precise latency measurement, torch.cuda.synchronize() for timing accuracy, and torch.cuda.max_memory_allocated() for peak memory tracking. These calls will fail without a CUDA-capable GPU.

Usage

Use this environment when:

  • Running latency benchmarks to measure inference speed before/after pruning
  • Running memory profiling to measure peak VRAM usage
  • Training or fine-tuning models in the examples or reproduce directories
  • Distributed training with multiple GPUs (requires RANK, WORLD_SIZE, LOCAL_RANK environment variables)

System Requirements

Category Requirement Notes
OS Linux (recommended), Windows with CUDA macOS not supported for CUDA
Hardware NVIDIA GPU with CUDA support No minimum VRAM specified; depends on model size
Driver NVIDIA CUDA Toolkit Must match PyTorch CUDA version

Dependencies

System Packages

  • NVIDIA GPU driver (compatible with CUDA toolkit version)
  • CUDA toolkit (bundled with PyTorch binary or installed separately)

Python Packages

  • torch >= 2.0 (CUDA build, not CPU-only)
  • All core dependencies from the PyTorch_Python_Core environment

Credentials

Distributed training environment variables (optional, only for multi-GPU):

  • RANK: Global rank of the current process
  • WORLD_SIZE: Total number of processes
  • LOCAL_RANK: Local rank on the current node
  • SLURM_PROCID: Alternative rank source for SLURM clusters

Quick Install

# Install PyTorch with CUDA support (example for CUDA 12.1)
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121

# Verify CUDA availability
python -c "import torch; print(torch.cuda.is_available())"

Code Evidence

CUDA timing events in torch_pruning/utils/benchmark.py:31-39:

start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record()
if run_fn is not None:
    _ = run_fn(model, example_inputs)
else:
    _ = model(example_inputs)
end.record()
torch.cuda.synchronize()

Peak memory measurement in torch_pruning/utils/benchmark.py:59-65:

torch.cuda.reset_peak_memory_stats()
model.eval()
if run_fn is not None:
    _ = run_fn(model, example_inputs)
else:
    _ = model(example_inputs)
return torch.cuda.max_memory_allocated(device=device)

Distributed training environment variable detection from reproduce/engine/utils/imagenet_utils/utils.py:246-251:

if "RANK" in os.environ and "WORLD_SIZE" in os.environ:
    args.rank = int(os.environ["RANK"])
    args.world_size = int(os.environ["WORLD_SIZE"])
    args.gpu = int(os.environ.get("LOCAL_RANK", 0))
elif "SLURM_PROCID" in os.environ:
    args.rank = int(os.environ["SLURM_PROCID"])

Common Errors

Error Message Cause Solution
RuntimeError: CUDA error: no CUDA-capable device is detected No NVIDIA GPU present or driver not installed Install NVIDIA drivers and ensure GPU is visible via nvidia-smi
RuntimeError: CUDA out of memory Model too large for available VRAM Reduce batch size, use a smaller model, or use gradient checkpointing
AssertionError: Default process group has not been initialized Distributed env vars not set Set RANK, WORLD_SIZE, LOCAL_RANK or use torchrun to launch

Compatibility Notes

  • Benchmark functions: measure_latency, measure_memory, measure_fps, measure_throughput all require CUDA and will fail on CPU-only machines.
  • Core pruning operations: The dependency graph builder, pruning functions, and importance estimators work on any device (CPU or GPU).
  • Test suite: Tests use device = 'cuda' if torch.cuda.is_available() else 'cpu' for conditional device selection.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment