Principle:Protectai Llm guard Topic Filtering
| Knowledge Sources | |
|---|---|
| Domains | Content_Filtering, Zero_Shot_Classification |
| Last Updated | 2026-02-14 12:00 GMT |
Overview
Detecting and blocking text about specified topics using zero-shot text classification with Natural Language Inference models.
Description
Topic Filtering is a content filtering principle that identifies the topical content of text without requiring topic-specific training data. It leverages zero-shot classification powered by Natural Language Inference (NLI) models to determine whether a given text discusses any of a set of user-specified banned topics.
The key insight of this approach is that topic classification can be reformulated as an entailment problem. Given a text and a candidate topic label, an NLI model determines whether the text entails the hypothesis "This text is about {topic}." A high entailment score indicates that the text is indeed about that topic. This formulation allows the system to classify text against arbitrary topic labels without any fine-tuning, making it highly flexible and adaptable to new use cases.
Supported model architectures include DeBERTa, RoBERTa, and BGE-M3, each offering different trade-offs between accuracy, speed, and multilingual support. The model scores each candidate topic independently, and any topic exceeding a configurable threshold triggers a detection.
Usage
Use this principle when you need to restrict language model interactions to specific subject areas or prevent discussion of sensitive topics. Applications include blocking discussions of violence, politics, or adult content in family-friendly deployments; preventing off-topic conversations in domain-specific assistants; and enforcing regulatory compliance by blocking discussion of restricted subjects. The zero-shot nature of this approach is especially valuable when topic lists change frequently or when there is no labeled training data for the specific topics of interest.
Theoretical Basis
The zero-shot topic classification algorithm works as follows:
Hypothesis Construction:
- For each banned topic T, construct a hypothesis string: "This text is about {T}"
- The input text serves as the premise in the NLI framework
NLI Scoring:
- For each (premise, hypothesis) pair, the NLI model produces logits for three classes: entailment, neutral, contradiction
- Apply softmax to obtain probability distribution across the three classes
- The entailment probability represents the confidence that the text is about the topic
Decision:
- For each topic, compare the entailment probability against the configured threshold
- If P(entailment) >= threshold for any banned topic, the text is flagged
- The scanner returns both the decision and the per-topic confidence scores
Multi-label handling:
- Each topic is scored independently (not as competing classes)
- A text can be flagged for multiple topics simultaneously