Workflow:Protectai Llm guard API Server Deployment
| Knowledge Sources | |
|---|---|
| Domains | LLM_Security, API_Deployment, DevOps |
| Last Updated | 2026-02-14 12:00 GMT |
Overview
End-to-end process for deploying LLM Guard as a production REST API service using FastAPI, Docker, and configurable YAML-based scanner pipelines.
Description
This workflow covers deploying the LLM Guard API server, a FastAPI application that exposes HTTP endpoints for scanning LLM prompts and outputs. The server is configured through a YAML file that defines which scanners to load, their parameters, authentication settings, rate limiting, and observability (OpenTelemetry tracing and Prometheus metrics). The API supports both sequential scanning (analyze endpoints that return sanitized text) and parallel scanning (scan endpoints that run scanners concurrently for lower latency). Deployment is handled through Docker Compose with a Uvicorn ASGI server.
Usage
Execute this workflow when you need to provide LLM Guard scanning as a shared service that multiple applications can call over HTTP. This is the recommended approach for production deployments where scanning should be centralized, independently scalable, and accessible to applications written in any language.
Execution Steps
Step 1: Write the scanner configuration YAML
Create a YAML configuration file that defines the complete scanner pipeline. The file specifies application settings (logging, timeouts, fail-fast mode), authentication (HTTP Bearer or Basic), rate limiting, observability exporters (OpenTelemetry, Prometheus), and ordered lists of input and output scanners with their parameters.
Key considerations:
- Scanners are applied in the order they appear in the YAML file
- Environment variables can be used in the YAML with ${VAR_NAME:default} syntax
- Each scanner entry has a type field (scanner class name) and optional params dictionary
- Set lazy_load to true for faster startup (scanners loaded on first request instead of at boot)
- Configure scan_fail_fast to stop the pipeline at the first scanner failure
Step 2: Configure authentication and rate limiting
Set up API authentication to protect the scanning endpoints. The server supports HTTP Bearer token authentication and HTTP Basic authentication. Configure rate limiting to prevent abuse of the scanning endpoints.
Key considerations:
- Bearer token auth checks requests against a configured token value
- Basic auth validates username and password pairs
- Rate limiting uses slowapi and is configured as requests per time period (e.g., "100/minute")
- Auth credentials should be injected via environment variables, not hardcoded in YAML
Step 3: Configure observability
Set up tracing and metrics exporters for production monitoring. The server integrates with OpenTelemetry for distributed tracing and supports Prometheus for metrics collection. Metrics include per-scanner validity counts and request latencies.
Key considerations:
- Tracing supports otel_http (OpenTelemetry HTTP exporter) and console modes
- Metrics supports otel_http, prometheus, and console exporters
- The /metrics endpoint is only available when the Prometheus exporter is configured
- OpenTelemetry instruments FastAPI routes, HTTP clients, and system resources automatically
Step 4: Build and deploy with Docker
Package the API server using Docker and deploy it with Docker Compose. The entrypoint script starts a Uvicorn ASGI server with configurable worker count and binding address. The configuration YAML is mounted into the container.
Key considerations:
- The Docker Compose configuration maps the config directory and exposes the API port
- Uvicorn workers should be set based on available CPU cores and expected concurrency
- The CONFIG_FILE environment variable points to the scanner YAML configuration path
- Health check endpoints (/healthz and /readyz) are available for container orchestration
Step 5: Use the scanning API endpoints
Send HTTP requests to the API to scan prompts and outputs. The server provides four main endpoints: /analyze/prompt and /analyze/output for sequential scanning with sanitization, and /scan/prompt and /scan/output for parallel scanning that returns only validity and risk scores without sanitization.
Key considerations:
- Analyze endpoints run scanners sequentially and return sanitized text plus risk scores
- Scan endpoints run scanners in parallel for lower latency but do not sanitize the text
- Requests can suppress specific scanners using the scanners_suppress field
- Timeouts are configurable per endpoint type (prompt vs. output) in the YAML config
- All endpoints require authentication if configured in the YAML
Step 6: Monitor and maintain
Monitor scanner performance through the configured observability stack. Track scanner validity rates, request latencies, and error rates to identify issues and tune scanner thresholds.
Key considerations:
- The /metrics Prometheus endpoint exposes scanner validity counters by scanner name and source (input/output)
- Tracing data includes per-scanner execution times for identifying performance bottlenecks
- Log output format is configurable: JSON for production log aggregation, plain text for development
- OpenAPI docs are available at /openapi.json only when debug mode is enabled