Principle:Protectai Llm guard Toxicity Detection
| Knowledge Sources | |
|---|---|
| Domains | NLP, Content_Moderation, Text_Classification |
| Last Updated | 2026-02-14 12:00 GMT |
Overview
A multi-label text classification technique that detects toxic, obscene, threatening, insulting, and identity-attacking content in text using fine-tuned transformer models with sigmoid activation.
Description
Toxicity detection uses multi-label classification to identify harmful content across several categories simultaneously: toxicity, severe toxicity, obscenity, threats, insults, identity attacks, and sexually explicit content. Unlike binary classification, multi-label models assign independent probability scores to each category, allowing detection of text that is e.g., both insulting and threatening.
The detection model uses sigmoid activation (not softmax) so labels are not mutually exclusive. The highest score across all toxic labels is compared against a configurable threshold to make the final validity determination.
Input can be analyzed as a whole or split into sentences for finer-grained detection.
Usage
Use this principle in both input and output scanning pipelines to prevent toxic content from being sent to or returned from LLMs. Essential for content moderation, safety compliance, and preventing the amplification of harmful language.
Theoretical Basis
# Pseudocode for multi-label toxicity detection
toxic_labels = ["toxicity", "severe_toxicity", "obscene", "threat", "insult", "identity_attack", "sexual_explicit"]
segments = match_type.get_inputs(text)
results = classifier(segments) # Multi-label with sigmoid
highest_score = 0.0
for result_set in results:
for label_result in result_set:
if label_result["label"] in toxic_labels:
if label_result["score"] > highest_score:
highest_score = label_result["score"]
if label_result["score"] > threshold:
return TOXIC_DETECTED
return SAFE