Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Protectai Llm guard Toxicity

From Leeroopedia
Knowledge Sources
Domains NLP, Content_Moderation, Text_Classification
Last Updated 2026-02-14 12:00 GMT

Overview

Concrete tool for detecting toxic language in text using the unbiased-toxic-roberta model, provided by the LLM Guard library.

Description

The Toxicity class is an input scanner that uses the unitary/unbiased-toxic-roberta model for multi-label toxicity classification. It detects seven categories of toxic content: toxicity, severe toxicity, obscenity, threats, insults, identity attacks, and sexually explicit content. The model uses sigmoid activation for independent label scoring.

Usage

Import this scanner to detect and block toxic content in user prompts. Also available as an output scanner for checking LLM responses.

Code Reference

Source Location

  • Repository: llm-guard
  • File: llm_guard/input_scanners/toxicity.py
  • Lines: L50-131

Signature

class Toxicity(Scanner):
    def __init__(
        self,
        *,
        model: Model | None = None,
        threshold: float = 0.5,
        match_type: MatchType | str = MatchType.FULL,
        use_onnx: bool = False,
    ) -> None:
        """
        Args:
            model: HuggingFace model for classification. Default: unitary/unbiased-toxic-roberta.
            threshold: Toxicity score threshold. Default: 0.5.
            match_type: FULL or SENTENCE level matching. Default: FULL.
            use_onnx: Use ONNX runtime for inference. Default: False.
        """

    def scan(self, prompt: str) -> tuple[str, bool, float]:
        """
        Classify text for toxic content.

        Returns:
            - Original prompt (unmodified)
            - False if toxicity detected above threshold, True if safe
            - Risk score normalized against threshold
        """

Import

from llm_guard.input_scanners import Toxicity

I/O Contract

Inputs

Name Type Required Description
model Model No HuggingFace model config (default: unbiased-toxic-roberta)
threshold float No Toxicity score threshold (default: 0.5)
match_type MatchType or str No FULL or SENTENCE level (default: FULL)
use_onnx bool No Use ONNX runtime (default: False)

Outputs

Name Type Description
prompt str Original prompt (unmodified)
is_valid bool False if any toxic label exceeds threshold
risk_score float Normalized highest toxicity score

Usage Examples

Basic Toxicity Detection

from llm_guard.input_scanners import Toxicity

scanner = Toxicity(threshold=0.5)

# Safe text
_, is_valid, score = scanner.scan("What is the weather today?")
# is_valid: True

# Toxic text
_, is_valid, score = scanner.scan("You are a terrible, worthless person!")
# is_valid: False

With ONNX Optimization

from llm_guard.input_scanners import Toxicity

scanner = Toxicity(threshold=0.5, use_onnx=True)
_, is_valid, score = scanner.scan(prompt)

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment