Implementation:Protectai Llm guard Toxicity
| Knowledge Sources | |
|---|---|
| Domains | NLP, Content_Moderation, Text_Classification |
| Last Updated | 2026-02-14 12:00 GMT |
Overview
Concrete tool for detecting toxic language in text using the unbiased-toxic-roberta model, provided by the LLM Guard library.
Description
The Toxicity class is an input scanner that uses the unitary/unbiased-toxic-roberta model for multi-label toxicity classification. It detects seven categories of toxic content: toxicity, severe toxicity, obscenity, threats, insults, identity attacks, and sexually explicit content. The model uses sigmoid activation for independent label scoring.
Usage
Import this scanner to detect and block toxic content in user prompts. Also available as an output scanner for checking LLM responses.
Code Reference
Source Location
- Repository: llm-guard
- File: llm_guard/input_scanners/toxicity.py
- Lines: L50-131
Signature
class Toxicity(Scanner):
def __init__(
self,
*,
model: Model | None = None,
threshold: float = 0.5,
match_type: MatchType | str = MatchType.FULL,
use_onnx: bool = False,
) -> None:
"""
Args:
model: HuggingFace model for classification. Default: unitary/unbiased-toxic-roberta.
threshold: Toxicity score threshold. Default: 0.5.
match_type: FULL or SENTENCE level matching. Default: FULL.
use_onnx: Use ONNX runtime for inference. Default: False.
"""
def scan(self, prompt: str) -> tuple[str, bool, float]:
"""
Classify text for toxic content.
Returns:
- Original prompt (unmodified)
- False if toxicity detected above threshold, True if safe
- Risk score normalized against threshold
"""
Import
from llm_guard.input_scanners import Toxicity
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model | Model | No | HuggingFace model config (default: unbiased-toxic-roberta) |
| threshold | float | No | Toxicity score threshold (default: 0.5) |
| match_type | MatchType or str | No | FULL or SENTENCE level (default: FULL) |
| use_onnx | bool | No | Use ONNX runtime (default: False) |
Outputs
| Name | Type | Description |
|---|---|---|
| prompt | str | Original prompt (unmodified) |
| is_valid | bool | False if any toxic label exceeds threshold |
| risk_score | float | Normalized highest toxicity score |
Usage Examples
Basic Toxicity Detection
from llm_guard.input_scanners import Toxicity
scanner = Toxicity(threshold=0.5)
# Safe text
_, is_valid, score = scanner.scan("What is the weather today?")
# is_valid: True
# Toxic text
_, is_valid, score = scanner.scan("You are a terrible, worthless person!")
# is_valid: False
With ONNX Optimization
from llm_guard.input_scanners import Toxicity
scanner = Toxicity(threshold=0.5, use_onnx=True)
_, is_valid, score = scanner.scan(prompt)
Related Pages
Implements Principle
Requires Environment
- Environment:Protectai_Llm_guard_Python_Runtime_Dependencies
- Environment:Protectai_Llm_guard_ONNX_Runtime_Acceleration