Implementation:VainF Torch Pruning Measure Latency

Metadata

Field	Value
Source	Torch-Pruning
Domains	Deep_Learning, Benchmarking
Last Updated	2026-02-08 00:00 GMT

Overview

Concrete tool for measuring GPU inference latency provided by Torch-Pruning.

Description

measure_latency runs the model in eval mode with warmup iterations, then measures inference time using torch.cuda.Event timing for GPU-accurate measurements. Returns mean and standard deviation of latency in milliseconds.

Code Reference

Source: torch_pruning/utils/benchmark.py, Lines 6-43
Signature:

def measure_latency(model, example_inputs, repeat=300, warmup=50, run_fn=None):
    """Measure model inference latency.

    Returns:
        Tuple of (mean_latency_ms, std_latency_ms).
    """

Import:

import torch_pruning as tp
tp.utils.benchmark.measure_latency

I/O Contract

Inputs

Parameter	Type	Required	Default
model	nn.Module	Yes	—
example_inputs	Tensor	Yes	—
repeat	int	No	300
warmup	int	No	50
run_fn	Callable	No	None

Outputs

(mean_latency_ms: float, std_latency_ms: float)

Usage Examples

import torch
import torch.nn as nn
import torch_pruning as tp
from torch_pruning.utils.benchmark import measure_latency

# Build a simple model and move to GPU
model = nn.Sequential(
    nn.Conv2d(3, 64, 3, padding=1),
    nn.BatchNorm2d(64),
    nn.ReLU(),
    nn.Conv2d(64, 128, 3, padding=1),
).cuda().eval()

example_inputs = torch.randn(1, 3, 224, 224).cuda()

# Measure latency BEFORE pruning
mean_before, std_before = measure_latency(model, example_inputs)
print(f"Before pruning: {mean_before:.2f} +/- {std_before:.2f} ms")

# ... apply pruning ...

# Measure latency AFTER pruning
mean_after, std_after = measure_latency(model, example_inputs)
print(f"After pruning:  {mean_after:.2f} +/- {std_after:.2f} ms")
print(f"Speedup: {mean_before / mean_after:.2f}x")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment