Principle:VainF Torch Pruning Latency Benchmarking
Metadata
| Field | Value |
|---|---|
| Domains | Deep_Learning, Model_Analysis, Benchmarking |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Empirical measurement of actual GPU inference time to validate that theoretical FLOPs reduction translates to real-world speedup.
Description
FLOPs reduction does not always translate linearly to wall-clock speedup due to memory bandwidth, parallelism, and hardware-specific effects. Latency benchmarking measures actual inference time by running the model multiple times on GPU with proper warmup, using CUDA events for precise timing. This provides the ground truth for whether pruning achieved meaningful speedup.
Usage
Use after pruning (and optionally before) to measure actual speedup on target hardware. Important for validating that the pruned architecture is actually faster, not just theoretically smaller.
Theoretical Basis
Measure latency as follows:
- Warm up for N iterations (to stabilize GPU state and caches).
- Time M iterations using CUDA events for hardware-accurate measurement.
- Report mean +/- std of per-iteration latency.
- Speedup ratio = latency_original / latency_pruned.
Key considerations:
- CUDA events bypass CPU-GPU synchronization overhead, giving precise GPU-side timing.
- Warmup iterations eliminate cold-start effects such as kernel compilation and memory allocation.
- Standard deviation captures variance from GPU scheduling and thermal throttling.