Heuristic:Protectai Llm guard ONNX Runtime Optimization
| Knowledge Sources | |
|---|---|
| Domains | Optimization, NLP |
| Last Updated | 2026-02-14 12:00 GMT |
Overview
Performance optimization technique using ONNX Runtime for transformer-based scanner inference, providing significant speedups especially on CPU-only deployments.
Description
ONNX (Open Neural Network Exchange) Runtime provides optimized inference for machine learning models. LLM Guard ships pre-exported ONNX model variants for most scanner models. When enabled, the optimum library loads ONNX-optimized versions of models (e.g., DeBERTa for prompt injection, RoBERTa for toxicity) instead of standard PyTorch models. This optimization is especially impactful on CPU, where ONNX Runtime applies graph-level optimizations, operator fusion, and hardware-specific kernels.
Usage
Use this heuristic when deploying LLM Guard in production or when scanner latency is a concern. Especially recommended for CPU-only deployments where ONNX provides the largest speedup. The API server automatically enables ONNX for all supported scanners.
The Insight (Rule of Thumb)
- Action: Pass
use_onnx=Truewhen initializing any scanner that supports it. - Value: Set per-scanner:
scanner = PromptInjection(use_onnx=True). - Trade-off: Requires installing
optimum[onnxruntime](extra ~200MB). Some models may need initial export on first run if no pre-exported ONNX variant exists. - Coverage: The API server forces ONNX for: Anonymize, BanCode, BanTopics, Code, EmotionDetection, Gibberish, Language, PromptInjection, Toxicity (input), and BanCode, BanTopics, Bias, Code, EmotionDetection, Language, LanguageSame, MaliciousURLs, NoRefusal, FactualConsistency, Gibberish, Relevance, Sensitive, Toxicity (output).
Reasoning
Transformer models used by LLM Guard scanners (DeBERTa, RoBERTa, BGE) are compute-intensive during inference. ONNX Runtime applies graph optimizations (constant folding, operator fusion, memory planning) and uses hardware-specific execution providers (MKL-DNN for CPU, cuDNN for GPU) that are more efficient than the default PyTorch eager execution. The speedup is most pronounced on CPU where PyTorch lacks these low-level optimizations. The API server codifies this as a best practice by forcing use_onnx=True for all supported scanners in the scanner initialization logic.
# From llm_guard_api/app/scanner.py:119-130
# API server forces ONNX for all supported scanners
if scanner_name in [
"Anonymize", "BanCode", "BanTopics", "Code",
"EmotionDetection", "Gibberish", "Language",
"PromptInjection", "Toxicity",
]:
scanner_config["use_onnx"] = True