Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Protectai Llm guard Topic Filtering

From Leeroopedia
Knowledge Sources
Domains Content_Filtering, Zero_Shot_Classification
Last Updated 2026-02-14 12:00 GMT

Overview

Detecting and blocking text about specified topics using zero-shot text classification with Natural Language Inference models.

Description

Topic Filtering is a content filtering principle that identifies the topical content of text without requiring topic-specific training data. It leverages zero-shot classification powered by Natural Language Inference (NLI) models to determine whether a given text discusses any of a set of user-specified banned topics.

The key insight of this approach is that topic classification can be reformulated as an entailment problem. Given a text and a candidate topic label, an NLI model determines whether the text entails the hypothesis "This text is about {topic}." A high entailment score indicates that the text is indeed about that topic. This formulation allows the system to classify text against arbitrary topic labels without any fine-tuning, making it highly flexible and adaptable to new use cases.

Supported model architectures include DeBERTa, RoBERTa, and BGE-M3, each offering different trade-offs between accuracy, speed, and multilingual support. The model scores each candidate topic independently, and any topic exceeding a configurable threshold triggers a detection.

Usage

Use this principle when you need to restrict language model interactions to specific subject areas or prevent discussion of sensitive topics. Applications include blocking discussions of violence, politics, or adult content in family-friendly deployments; preventing off-topic conversations in domain-specific assistants; and enforcing regulatory compliance by blocking discussion of restricted subjects. The zero-shot nature of this approach is especially valuable when topic lists change frequently or when there is no labeled training data for the specific topics of interest.

Theoretical Basis

The zero-shot topic classification algorithm works as follows:

Hypothesis Construction:

  • For each banned topic T, construct a hypothesis string: "This text is about {T}"
  • The input text serves as the premise in the NLI framework

NLI Scoring:

  • For each (premise, hypothesis) pair, the NLI model produces logits for three classes: entailment, neutral, contradiction
  • Apply softmax to obtain probability distribution across the three classes
  • The entailment probability represents the confidence that the text is about the topic

Decision:

  • For each topic, compare the entailment probability against the configured threshold
  • If P(entailment) >= threshold for any banned topic, the text is flagged
  • The scanner returns both the decision and the per-topic confidence scores

Multi-label handling:

  • Each topic is scored independently (not as competing classes)
  • A text can be flagged for multiple topics simultaneously

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment