Heuristic:Protectai Llm guard PyTorch Compile Warmup
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Infrastructure |
| Last Updated | 2026-02-14 12:00 GMT |
Overview
PyTorch performance optimization by setting float32 matrix multiplication precision to 'high' and enabling the inductor FX graph cache to reduce warm compile times.
Description
PyTorch provides several global settings that can improve inference performance. The benchmarks module in LLM Guard uses two specific optimizations: (1) setting torch.set_float32_matmul_precision('high') to allow TF32 operations on Ampere+ GPUs, and (2) enabling torch._inductor.config.fx_graph_cache = True to cache compiled graphs across runs. The API server also uses torch.set_num_threads(1) to prevent thread contention in multi-worker deployments.
Usage
Use these heuristics when running benchmarks or GPU-accelerated inference on Ampere (A100) or newer NVIDIA GPUs. The FX graph cache is beneficial for repeated inference with the same model architecture, reducing compilation overhead on subsequent runs. The thread limit is important for multi-worker API deployments to prevent CPU oversubscription.
The Insight (Rule of Thumb)
- Action 1: Set
torch.set_float32_matmul_precision('high')at the start of your script. - Action 2: Enable
torch._inductor.config.fx_graph_cache = True. - Action 3: For multi-worker serving, set
torch.set_num_threads(1). - Value: All boolean/enum flags; no parameter tuning needed.
- Trade-off:
float32_matmul_precision='high'uses TF32 which has slightly reduced precision (10-bit mantissa vs 23-bit for FP32). This is generally acceptable for inference classification tasks. Settingnum_threads=1reduces single-request throughput but prevents contention.
Reasoning
On Ampere+ GPUs, TF32 matrix multiplications are up to 8x faster than FP32 for the same operations. For classification tasks (where outputs are softmax probabilities), the precision loss is negligible. The FX graph cache avoids re-compilation of PyTorch inductor graphs when the same model shapes are used across sessions. Thread limiting prevents the common problem where multiple Uvicorn workers each spawn N PyTorch threads, causing CPU oversubscription.
# From benchmarks/run.py:17-21
torch.set_float32_matmul_precision("high")
import torch._inductor.config
torch._inductor.config.fx_graph_cache = True
# From llm_guard_api/app/scanner.py:31
torch.set_num_threads(1)