Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Protectai Llm guard PyTorch Compile Warmup

From Leeroopedia
Knowledge Sources
Domains Optimization, Infrastructure
Last Updated 2026-02-14 12:00 GMT

Overview

PyTorch performance optimization by setting float32 matrix multiplication precision to 'high' and enabling the inductor FX graph cache to reduce warm compile times.

Description

PyTorch provides several global settings that can improve inference performance. The benchmarks module in LLM Guard uses two specific optimizations: (1) setting torch.set_float32_matmul_precision('high') to allow TF32 operations on Ampere+ GPUs, and (2) enabling torch._inductor.config.fx_graph_cache = True to cache compiled graphs across runs. The API server also uses torch.set_num_threads(1) to prevent thread contention in multi-worker deployments.

Usage

Use these heuristics when running benchmarks or GPU-accelerated inference on Ampere (A100) or newer NVIDIA GPUs. The FX graph cache is beneficial for repeated inference with the same model architecture, reducing compilation overhead on subsequent runs. The thread limit is important for multi-worker API deployments to prevent CPU oversubscription.

The Insight (Rule of Thumb)

  • Action 1: Set torch.set_float32_matmul_precision('high') at the start of your script.
  • Action 2: Enable torch._inductor.config.fx_graph_cache = True.
  • Action 3: For multi-worker serving, set torch.set_num_threads(1).
  • Value: All boolean/enum flags; no parameter tuning needed.
  • Trade-off: float32_matmul_precision='high' uses TF32 which has slightly reduced precision (10-bit mantissa vs 23-bit for FP32). This is generally acceptable for inference classification tasks. Setting num_threads=1 reduces single-request throughput but prevents contention.

Reasoning

On Ampere+ GPUs, TF32 matrix multiplications are up to 8x faster than FP32 for the same operations. For classification tasks (where outputs are softmax probabilities), the precision loss is negligible. The FX graph cache avoids re-compilation of PyTorch inductor graphs when the same model shapes are used across sessions. Thread limiting prevents the common problem where multiple Uvicorn workers each spawn N PyTorch threads, causing CPU oversubscription.

# From benchmarks/run.py:17-21
torch.set_float32_matmul_precision("high")

import torch._inductor.config
torch._inductor.config.fx_graph_cache = True
# From llm_guard_api/app/scanner.py:31
torch.set_num_threads(1)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment