Workflow:Protectai Llm guard Scanner Benchmarking

Knowledge Sources	LLM Guard LLM Guard Docs
Domains	LLM_Security, Performance_Testing, Benchmarking
Last Updated	2026-02-14 12:00 GMT

Overview

End-to-end process for measuring and comparing the latency, throughput, and variance of individual LLM Guard scanners using the built-in benchmarking CLI tool.

Description

This workflow uses the LLM Guard benchmarking harness to profile individual scanner performance. The tool instantiates a scanner with configurable parameters (including ONNX optimization), runs it against representative test data multiple times, and reports statistical metrics including average latency, percentile latencies (p90, p95, p99), variance, and throughput (characters per second). Results are output as structured JSON for easy integration into CI pipelines and performance dashboards.

Usage

Execute this workflow when you need to evaluate scanner performance before deploying to production, compare PyTorch versus ONNX Runtime inference speeds, identify bottleneck scanners in your pipeline, or validate that scanner latency meets your application's SLA requirements.

Execution Steps

Step 1: Prepare benchmark test data

Create or verify the JSON test data files that provide representative inputs for each scanner. The input test data file contains prompt strings keyed by scanner name. The output test data file contains prompt-output pairs keyed by scanner name. Each entry should be representative of real-world inputs the scanner will process in production.

Key considerations:

Input examples are stored in input_examples.json with scanner names as keys
Output examples are stored in output_examples.json with scanner names as keys, each containing a prompt-output pair
Test data should include both benign and adversarial examples to measure worst-case performance
Input length affects throughput calculations, so use production-representative text lengths

Step 2: Select the scanner and configuration

Choose the scanner to benchmark, its type (input or output), the number of repetitions, and whether to use ONNX Runtime optimization. The benchmark tool accepts these as command-line arguments.

Key considerations:

The type argument specifies whether to benchmark an input scanner or output scanner
The scanner argument is the scanner class name (e.g., PromptInjection, Toxicity, Anonymize)
The repeat argument controls how many times to run the scanner (default 5) for statistical reliability
The use-onnx flag enables ONNX Runtime inference where supported, which is typically faster for production

Step 3: Run the benchmark

Execute the benchmarking CLI tool. The tool instantiates the selected scanner, loads test data, performs a warmup run, and then executes the scanner the specified number of times while recording latency for each run.

What happens:

The scanner is instantiated with the configured parameters (including ONNX if requested)
PyTorch float32 matmul precision is set to "high" and inductor graph caching is enabled
The timeit.repeat function measures wall-clock time for each scanner invocation
All runs use the same input data to ensure consistent comparison

Step 4: Analyze the results

Review the structured JSON output containing performance statistics. The output includes average latency in milliseconds, latency variance, percentile latencies (p90, p95, p99), input length, number of test runs, and throughput in characters per second (QPS).

Key considerations:

Average latency gives the typical per-request overhead this scanner adds to the pipeline
Variance indicates consistency: high variance suggests the scanner has unpredictable performance
p99 latency is critical for SLA planning as it represents worst-case tail latency
QPS (throughput) helps estimate how many requests a single scanner instance can handle
Compare ONNX vs. PyTorch results to determine the optimal backend for your deployment

Step 5: Optimize based on results

Use benchmark results to optimize your scanner pipeline configuration. Reorder scanners, adjust thresholds, switch between PyTorch and ONNX backends, or configure model parameters like batch size and max sequence length to meet your latency targets.

Key considerations:

Place fast scanners before slow scanners with fail_fast enabled to minimize average pipeline latency
ONNX Runtime typically provides 2-3x speedup over PyTorch for inference-only workloads
Reducing model_max_length trades accuracy for speed on long inputs
Consider running expensive scanners in parallel (as the API server scan endpoints do) rather than sequentially

Execution Diagram

GitHub URL

Workflow Repository