Heuristic:Protectai Llm guard ONNX Runtime Optimization

Knowledge Sources	LLM Guard Optimization Tutorial ONNX Runtime
Domains	Optimization, NLP
Last Updated	2026-02-14 12:00 GMT

Overview

Performance optimization technique using ONNX Runtime for transformer-based scanner inference, providing significant speedups especially on CPU-only deployments.

Description

ONNX (Open Neural Network Exchange) Runtime provides optimized inference for machine learning models. LLM Guard ships pre-exported ONNX model variants for most scanner models. When enabled, the optimum library loads ONNX-optimized versions of models (e.g., DeBERTa for prompt injection, RoBERTa for toxicity) instead of standard PyTorch models. This optimization is especially impactful on CPU, where ONNX Runtime applies graph-level optimizations, operator fusion, and hardware-specific kernels.

Usage

Use this heuristic when deploying LLM Guard in production or when scanner latency is a concern. Especially recommended for CPU-only deployments where ONNX provides the largest speedup. The API server automatically enables ONNX for all supported scanners.

The Insight (Rule of Thumb)

Action: Pass use_onnx=True when initializing any scanner that supports it.
Value: Set per-scanner: scanner = PromptInjection(use_onnx=True).
Trade-off: Requires installing optimum[onnxruntime] (extra ~200MB). Some models may need initial export on first run if no pre-exported ONNX variant exists.
Coverage: The API server forces ONNX for: Anonymize, BanCode, BanTopics, Code, EmotionDetection, Gibberish, Language, PromptInjection, Toxicity (input), and BanCode, BanTopics, Bias, Code, EmotionDetection, Language, LanguageSame, MaliciousURLs, NoRefusal, FactualConsistency, Gibberish, Relevance, Sensitive, Toxicity (output).

Reasoning

Transformer models used by LLM Guard scanners (DeBERTa, RoBERTa, BGE) are compute-intensive during inference. ONNX Runtime applies graph optimizations (constant folding, operator fusion, memory planning) and uses hardware-specific execution providers (MKL-DNN for CPU, cuDNN for GPU) that are more efficient than the default PyTorch eager execution. The speedup is most pronounced on CPU where PyTorch lacks these low-level optimizations. The API server codifies this as a best practice by forcing use_onnx=True for all supported scanners in the scanner initialization logic.

# From llm_guard_api/app/scanner.py:119-130
# API server forces ONNX for all supported scanners
if scanner_name in [
    "Anonymize", "BanCode", "BanTopics", "Code",
    "EmotionDetection", "Gibberish", "Language",
    "PromptInjection", "Toxicity",
]:
    scanner_config["use_onnx"] = True

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment