Environment:VainF Torch Pruning CUDA GPU Benchmarking

Knowledge Sources	Torch-Pruning NVIDIA Deep Learning Performance
Domains	Infrastructure, Deep_Learning
Last Updated	2026-02-08 12:00 GMT

Overview

NVIDIA CUDA GPU environment required for latency benchmarking, memory profiling, and GPU-accelerated training workflows.

Description

This environment extends the core PyTorch/Python environment with a CUDA-capable NVIDIA GPU. While the Torch-Pruning library's core operations (dependency graph analysis, pruning execution, importance estimation) work on CPU, the benchmarking utilities (measure_latency, measure_memory, measure_fps, measure_throughput) unconditionally use CUDA timing events and memory APIs. Training workflows in the examples and reproduce directories also assume GPU availability.

The benchmarking module uses torch.cuda.Event(enable_timing=True) for precise latency measurement, torch.cuda.synchronize() for timing accuracy, and torch.cuda.max_memory_allocated() for peak memory tracking. These calls will fail without a CUDA-capable GPU.

Usage

Use this environment when:

Running latency benchmarks to measure inference speed before/after pruning
Running memory profiling to measure peak VRAM usage
Training or fine-tuning models in the examples or reproduce directories
Distributed training with multiple GPUs (requires RANK, WORLD_SIZE, LOCAL_RANK environment variables)

System Requirements

Category	Requirement	Notes
OS	Linux (recommended), Windows with CUDA	macOS not supported for CUDA
Hardware	NVIDIA GPU with CUDA support	No minimum VRAM specified; depends on model size
Driver	NVIDIA CUDA Toolkit	Must match PyTorch CUDA version

Dependencies

System Packages

NVIDIA GPU driver (compatible with CUDA toolkit version)
CUDA toolkit (bundled with PyTorch binary or installed separately)

Python Packages

torch >= 2.0 (CUDA build, not CPU-only)
All core dependencies from the PyTorch_Python_Core environment

Credentials

Distributed training environment variables (optional, only for multi-GPU):

RANK: Global rank of the current process
WORLD_SIZE: Total number of processes
LOCAL_RANK: Local rank on the current node
SLURM_PROCID: Alternative rank source for SLURM clusters

Quick Install

# Install PyTorch with CUDA support (example for CUDA 12.1)
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121

# Verify CUDA availability
python -c "import torch; print(torch.cuda.is_available())"

Code Evidence

CUDA timing events in torch_pruning/utils/benchmark.py:31-39:

start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record()
if run_fn is not None:
    _ = run_fn(model, example_inputs)
else:
    _ = model(example_inputs)
end.record()
torch.cuda.synchronize()

Peak memory measurement in torch_pruning/utils/benchmark.py:59-65:

torch.cuda.reset_peak_memory_stats()
model.eval()
if run_fn is not None:
    _ = run_fn(model, example_inputs)
else:
    _ = model(example_inputs)
return torch.cuda.max_memory_allocated(device=device)

Distributed training environment variable detection from reproduce/engine/utils/imagenet_utils/utils.py:246-251:

if "RANK" in os.environ and "WORLD_SIZE" in os.environ:
    args.rank = int(os.environ["RANK"])
    args.world_size = int(os.environ["WORLD_SIZE"])
    args.gpu = int(os.environ.get("LOCAL_RANK", 0))
elif "SLURM_PROCID" in os.environ:
    args.rank = int(os.environ["SLURM_PROCID"])

Common Errors

Error Message	Cause	Solution
`RuntimeError: CUDA error: no CUDA-capable device is detected`	No NVIDIA GPU present or driver not installed	Install NVIDIA drivers and ensure GPU is visible via `nvidia-smi`
`RuntimeError: CUDA out of memory`	Model too large for available VRAM	Reduce batch size, use a smaller model, or use gradient checkpointing
`AssertionError: Default process group has not been initialized`	Distributed env vars not set	Set `RANK`, `WORLD_SIZE`, `LOCAL_RANK` or use `torchrun` to launch

Compatibility Notes

Benchmark functions: measure_latency, measure_memory, measure_fps, measure_throughput all require CUDA and will fail on CPU-only machines.
Core pruning operations: The dependency graph builder, pruning functions, and importance estimators work on any device (CPU or GPU).
Test suite: Tests use device = 'cuda' if torch.cuda.is_available() else 'cpu' for conditional device selection.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment