Implementation:Protectai Llm guard Toxicity

Knowledge Sources	LLM Guard LLM Guard Documentation
Domains	NLP, Content_Moderation, Text_Classification
Last Updated	2026-02-14 12:00 GMT

Overview

Concrete tool for detecting toxic language in text using the unbiased-toxic-roberta model, provided by the LLM Guard library.

Description

The Toxicity class is an input scanner that uses the unitary/unbiased-toxic-roberta model for multi-label toxicity classification. It detects seven categories of toxic content: toxicity, severe toxicity, obscenity, threats, insults, identity attacks, and sexually explicit content. The model uses sigmoid activation for independent label scoring.

Usage

Import this scanner to detect and block toxic content in user prompts. Also available as an output scanner for checking LLM responses.

Code Reference

Source Location

Repository: llm-guard
File: llm_guard/input_scanners/toxicity.py
Lines: L50-131

Signature

class Toxicity(Scanner):
    def __init__(
        self,
        *,
        model: Model | None = None,
        threshold: float = 0.5,
        match_type: MatchType | str = MatchType.FULL,
        use_onnx: bool = False,
    ) -> None:
        """
        Args:
            model: HuggingFace model for classification. Default: unitary/unbiased-toxic-roberta.
            threshold: Toxicity score threshold. Default: 0.5.
            match_type: FULL or SENTENCE level matching. Default: FULL.
            use_onnx: Use ONNX runtime for inference. Default: False.
        """

    def scan(self, prompt: str) -> tuple[str, bool, float]:
        """
        Classify text for toxic content.

        Returns:
            - Original prompt (unmodified)
            - False if toxicity detected above threshold, True if safe
            - Risk score normalized against threshold
        """

Import

from llm_guard.input_scanners import Toxicity

I/O Contract

Inputs

Name	Type	Required	Description
model	Model	No	HuggingFace model config (default: unbiased-toxic-roberta)
threshold	float	No	Toxicity score threshold (default: 0.5)
match_type	MatchType or str	No	FULL or SENTENCE level (default: FULL)
use_onnx	bool	No	Use ONNX runtime (default: False)

Outputs

Name	Type	Description
prompt	str	Original prompt (unmodified)
is_valid	bool	False if any toxic label exceeds threshold
risk_score	float	Normalized highest toxicity score

Usage Examples

Basic Toxicity Detection

from llm_guard.input_scanners import Toxicity

scanner = Toxicity(threshold=0.5)

# Safe text
_, is_valid, score = scanner.scan("What is the weather today?")
# is_valid: True

# Toxic text
_, is_valid, score = scanner.scan("You are a terrible, worthless person!")
# is_valid: False

With ONNX Optimization

from llm_guard.input_scanners import Toxicity

scanner = Toxicity(threshold=0.5, use_onnx=True)
_, is_valid, score = scanner.scan(prompt)

Related Pages

Implements Principle

Principle:Protectai_Llm_guard_Toxicity_Detection

Requires Environment

Uses Heuristic

Heuristic:Protectai_Llm_guard_ONNX_Runtime_Optimization

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment